Building a GDPR-Compliant Data Platform for SaaS with Databricks
Goal
Build a data mesh architecture for a SaaS equity management platform that enables internal analytics, external self-service access, and real-time embedded analytics while maintaining GDPR compliance.
Core Requirements
- Decentralized data ownership with domain-driven architecture
- Real-time and batch processing capabilities
- Multi-tenant isolation with row/column-level security
- Self-service access via APIs, SQL, and BI tools
- Full data lineage and metadata management
Data Domains
Organize data into business-aligned domains:
- Core Business Entities - EquityPlans, Grants, Products
- Users & Identity - Users, Permissions, Sessions
- Operational Activities - ClientOps, Billing, Support
- Usage & Engagement - Events, Interactions, Feature Usage
- Outcomes & Performance - KPIs, Adoption, Forecasts
- External Integrations - Valuation APIs, Compliance Feeds
Each domain owns its datasets, publishes data contracts, and maintains SLAs.
Architecture Stack
Ingestion: Kafka + Databricks Delta Live Tables
Processing: Databricks (Spark, Delta Lake)
Governance: Unity Catalog + DataHub
Orchestration: Databricks Workflows
Serving: Databricks SQL + REST APIs + Embedded BI
Multi-Tenancy Implementation
Partition all tables by tenant_id:
CREATE TABLE equity_grants (
tenant_id STRING NOT NULL,
grant_id STRING,
user_id STRING,
grant_date DATE,
...
) PARTITIONED BY (tenant_id);
Enforce row-level security with Unity Catalog:
CREATE ROW FILTER tenant_filter
AS (tenant_id) -> tenant_id = current_user_tenant();
ALTER TABLE equity_grants SET ROW FILTER tenant_filter ON (tenant_id);
Streaming Ingestion
Configure Kafka topics per domain:
## Delta Live Tables pipeline
@dlt.table
def equity_events_bronze():
return (
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka:9092")
.option("subscribe", "equity.grants,equity.exercises")
.option("startingOffsets", "latest")
.load()
)
Transform with Delta Live Tables:
@dlt.table
@dlt.expect_or_drop("valid_tenant", "tenant_id IS NOT NULL")
def equity_grants_silver():
return (
dlt.read_stream("equity_events_bronze")
.selectExpr("CAST(value AS STRING) as json_data")
.select(from_json("json_data", schema).alias("data"))
.select("data.*")
)
Data Access Layers
API Access
## FastAPI endpoint with tenant isolation
@app.get("/api/v1/grants")
async def get_grants(
current_user: User = Depends(get_current_user)
):
query = f"""
SELECT * FROM equity_grants
WHERE tenant_id = '{current_user.tenant_id}'
"""
return databricks_sql.execute(query)
Direct SQL Access
Configure Unity Catalog permissions:
GRANT SELECT ON TABLE equity_grants TO ROLE client_admin;
GRANT SELECT ON TABLE equity_grants TO ROLE internal_analyst;
-- Column masking for PII
CREATE FUNCTION mask_email(email STRING)
RETURNS STRING
RETURN CONCAT(SUBSTRING(email, 1, 2), '***@***.com');
ALTER TABLE users ALTER COLUMN email SET MASK mask_email;
Embedded BI
// Embed Power BI report with tenant context
const embedConfig = {
type: 'report',
tokenType: models.TokenType.Embed,
accessToken: token,
filters: [{
$schema: "http://powerbi.com/product/schema#basic",
target: {
table: "equity_grants",
column: "tenant_id"
},
operator: "In",
values: [currentUser.tenantId]
}]
};
GDPR Compliance
Subject Access Request
def extract_user_data(user_id: str, tenant_id: str):
tables = unity_catalog.list_tables(schema="gdpr_scope")
for table in tables:
query = f"""
SELECT * FROM {table}
WHERE tenant_id = '{tenant_id}'
AND user_id = '{user_id}'
"""
yield databricks_sql.execute(query)
Data Erasure
def erase_user_data(user_id: str, tenant_id: str):
# Mark for deletion in Delta Lake
for table in gdpr_tables:
spark.sql(f"""
UPDATE {table}
SET deleted_at = current_timestamp(),
pii_fields = NULL
WHERE tenant_id = '{tenant_id}'
AND user_id = '{user_id}'
""")
# Audit trail
log_gdpr_action("ERASURE", user_id, tenant_id)
PII Annotation
ALTER TABLE users
ALTER COLUMN email SET TAGS ('pii' = 'email', 'retention' = '7y');
ALTER TABLE users
ALTER COLUMN ssn SET TAGS ('pii' = 'sensitive', 'retention' = '7y');
Metadata Management with DataHub
Deploy DataHub for lineage tracking:
## docker-compose.yml
version: '3'
services:
datahub-gms:
image: acryldata/datahub-gms:latest
environment:
- DATAHUB_ANALYTICS_ENABLED=true
datahub-frontend:
image: acryldata/datahub-frontend-react:latest
Ingest Unity Catalog metadata:
pip install 'acryl-datahub[databricks]'
datahub ingest -c databricks-recipe.yml
## databricks-recipe.yml
source:
type: databricks
config:
workspace_url: "https://workspace.databricks.com"
token: "${DATABRICKS_TOKEN}"
unity_catalog_enabled: true
sink:
type: datahub-rest
config:
server: "http://datahub-gms:8080"
Access Control by Persona
-- Internal Data Scientist
GRANT USE CATALOG ON CATALOG main TO ROLE data_scientist;
GRANT SELECT ON SCHEMA main.* TO ROLE data_scientist;
-- Client Admin (HR)
GRANT SELECT ON TABLE equity_grants TO ROLE client_admin;
GRANT SELECT ON TABLE equity_exercises TO ROLE client_admin;
-- End User (Employee)
CREATE VIEW user_equity_view AS
SELECT * FROM equity_grants
WHERE user_id = current_user();
GRANT SELECT ON VIEW user_equity_view TO ROLE end_user;
Monitoring & Audit
-- Create audit log table
CREATE TABLE audit_log (
timestamp TIMESTAMP,
user_id STRING,
tenant_id STRING,
action STRING,
table_name STRING,
row_count BIGINT,
query_text STRING
) PARTITIONED BY (DATE(timestamp));
-- Capture all queries via system tables
CREATE LIVE TABLE query_audit AS
SELECT
query_start_time,
user_identity,
executed_as_user_name,
statement_text
FROM system.query.history
WHERE statement_type = 'SELECT';
Deployment
## Initialize Databricks workspace
databricks workspace import_dir ./pipelines /Workspace/pipelines
## Deploy Delta Live Tables
databricks pipelines create \
--name "equity-data-pipeline" \
--storage "/mnt/delta/equity" \
--target "production" \
--configuration pipeline.json
## Set up Unity Catalog
databricks unity-catalog create-catalog \
--name production \
--comment "Production data catalog"
This architecture provides a scalable, governed data platform that balances self-service access with strict compliance requirements.