Building a GDPR-Compliant Data Platform for SaaS with Databricks

Goal

Build a data mesh architecture for a SaaS equity management platform that enables internal analytics, external self-service access, and real-time embedded analytics while maintaining GDPR compliance.

Core Requirements

Decentralized data ownership with domain-driven architecture
Real-time and batch processing capabilities
Multi-tenant isolation with row/column-level security
Self-service access via APIs, SQL, and BI tools
Full data lineage and metadata management

Data Domains

Organize data into business-aligned domains:

Core Business Entities - EquityPlans, Grants, Products
Users & Identity - Users, Permissions, Sessions
Operational Activities - ClientOps, Billing, Support
Usage & Engagement - Events, Interactions, Feature Usage
Outcomes & Performance - KPIs, Adoption, Forecasts
External Integrations - Valuation APIs, Compliance Feeds

Each domain owns its datasets, publishes data contracts, and maintains SLAs.

Architecture Stack

Ingestion: Kafka + Databricks Delta Live Tables
Processing: Databricks (Spark, Delta Lake)
Governance: Unity Catalog + DataHub
Orchestration: Databricks Workflows
Serving: Databricks SQL + REST APIs + Embedded BI

Multi-Tenancy Implementation

Partition all tables by tenant_id:

CREATE TABLE equity_grants (
  tenant_id STRING NOT NULL,
  grant_id STRING,
  user_id STRING,
  grant_date DATE,
  ...
) PARTITIONED BY (tenant_id);

Enforce row-level security with Unity Catalog:

CREATE ROW FILTER tenant_filter
AS (tenant_id) -> tenant_id = current_user_tenant();

ALTER TABLE equity_grants SET ROW FILTER tenant_filter ON (tenant_id);

Streaming Ingestion

Configure Kafka topics per domain:

## Delta Live Tables pipeline
@dlt.table
def equity_events_bronze():
    return (
        spark.readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "kafka:9092")
        .option("subscribe", "equity.grants,equity.exercises")
        .option("startingOffsets", "latest")
        .load()
    )

Transform with Delta Live Tables:

@dlt.table
@dlt.expect_or_drop("valid_tenant", "tenant_id IS NOT NULL")
def equity_grants_silver():
    return (
        dlt.read_stream("equity_events_bronze")
        .selectExpr("CAST(value AS STRING) as json_data")
        .select(from_json("json_data", schema).alias("data"))
        .select("data.*")
    )

Data Access Layers

API Access

## FastAPI endpoint with tenant isolation
@app.get("/api/v1/grants")
async def get_grants(
    current_user: User = Depends(get_current_user)
):
    query = f"""
        SELECT * FROM equity_grants 
        WHERE tenant_id = '{current_user.tenant_id}'
    """
    return databricks_sql.execute(query)

Direct SQL Access

Configure Unity Catalog permissions:

GRANT SELECT ON TABLE equity_grants TO ROLE client_admin;
GRANT SELECT ON TABLE equity_grants TO ROLE internal_analyst;

-- Column masking for PII
CREATE FUNCTION mask_email(email STRING)
RETURNS STRING
RETURN CONCAT(SUBSTRING(email, 1, 2), '***@***.com');

ALTER TABLE users ALTER COLUMN email SET MASK mask_email;

Embedded BI

// Embed Power BI report with tenant context
const embedConfig = {
    type: 'report',
    tokenType: models.TokenType.Embed,
    accessToken: token,
    filters: [{
        $schema: "http://powerbi.com/product/schema#basic",
        target: {
            table: "equity_grants",
            column: "tenant_id"
        },
        operator: "In",
        values: [currentUser.tenantId]
    }]
};

Subject Access Request

def extract_user_data(user_id: str, tenant_id: str):
    tables = unity_catalog.list_tables(schema="gdpr_scope")
    
    for table in tables:
        query = f"""
            SELECT * FROM {table}
            WHERE tenant_id = '{tenant_id}' 
            AND user_id = '{user_id}'
        """
        yield databricks_sql.execute(query)

Data Erasure

def erase_user_data(user_id: str, tenant_id: str):
    # Mark for deletion in Delta Lake
    for table in gdpr_tables:
        spark.sql(f"""
            UPDATE {table}
            SET deleted_at = current_timestamp(),
                pii_fields = NULL
            WHERE tenant_id = '{tenant_id}' 
            AND user_id = '{user_id}'
        """)
    
    # Audit trail
    log_gdpr_action("ERASURE", user_id, tenant_id)

PII Annotation

ALTER TABLE users 
ALTER COLUMN email SET TAGS ('pii' = 'email', 'retention' = '7y');

ALTER TABLE users 
ALTER COLUMN ssn SET TAGS ('pii' = 'sensitive', 'retention' = '7y');

Metadata Management with DataHub

Deploy DataHub for lineage tracking:

## docker-compose.yml
version: '3'
services:
  datahub-gms:
    image: acryldata/datahub-gms:latest
    environment:
      - DATAHUB_ANALYTICS_ENABLED=true
  
  datahub-frontend:
    image: acryldata/datahub-frontend-react:latest

Ingest Unity Catalog metadata:

pip install 'acryl-datahub[databricks]'

datahub ingest -c databricks-recipe.yml

## databricks-recipe.yml
source:
  type: databricks
  config:
    workspace_url: "https://workspace.databricks.com"
    token: "${DATABRICKS_TOKEN}"
    unity_catalog_enabled: true
    
sink:
  type: datahub-rest
  config:
    server: "http://datahub-gms:8080"

Access Control by Persona

-- Internal Data Scientist
GRANT USE CATALOG ON CATALOG main TO ROLE data_scientist;
GRANT SELECT ON SCHEMA main.* TO ROLE data_scientist;

-- Client Admin (HR)
GRANT SELECT ON TABLE equity_grants TO ROLE client_admin;
GRANT SELECT ON TABLE equity_exercises TO ROLE client_admin;

-- End User (Employee)
CREATE VIEW user_equity_view AS
SELECT * FROM equity_grants 
WHERE user_id = current_user();

GRANT SELECT ON VIEW user_equity_view TO ROLE end_user;

Monitoring & Audit

-- Create audit log table
CREATE TABLE audit_log (
  timestamp TIMESTAMP,
  user_id STRING,
  tenant_id STRING,
  action STRING,
  table_name STRING,
  row_count BIGINT,
  query_text STRING
) PARTITIONED BY (DATE(timestamp));

-- Capture all queries via system tables
CREATE LIVE TABLE query_audit AS
SELECT 
  query_start_time,
  user_identity,
  executed_as_user_name,
  statement_text
FROM system.query.history
WHERE statement_type = 'SELECT';

Deployment

## Initialize Databricks workspace
databricks workspace import_dir ./pipelines /Workspace/pipelines

## Deploy Delta Live Tables
databricks pipelines create \
  --name "equity-data-pipeline" \
  --storage "/mnt/delta/equity" \
  --target "production" \
  --configuration pipeline.json

## Set up Unity Catalog
databricks unity-catalog create-catalog \
  --name production \
  --comment "Production data catalog"

This architecture provides a scalable, governed data platform that balances self-service access with strict compliance requirements.