Databricks PHI Detection
Learn how to detect PHI (Protected Health Information) in Databricks environments. Follow step-by-step guidance for HIPAA compliance.
Why It Matters
The core goal is to identify every location where Protected Health Information (PHI) is stored within your Databricks environment, so you can remediate unintended exposures before they become breaches. Scanning for PHI in Databricks is a priority for healthcare organizations subject to HIPAA, as it helps you prove you've discovered and accounted for all sensitive patient data—mitigating the risk of unauthorized access and potential data exposure.
A thorough scan delivers immediate visibility, laying the foundation for automated policy enforcement and ongoing compliance.
Prerequisites
Permissions & Roles
- Databricks admin or service principal
- catalogs/read, schemas/read, tables/read privileges
- Ability to install Databricks CLI or Terraform
External Tools
- Databricks CLI
- Cyera DSPM account
- API credentials
Prior Setup
- Databricks workspace provisioned
- Unity Catalog enabled
- CLI authenticated
- Networking rules configured
Introducing Cyera
Cyera is a modern Data Security Posture Management (DSPM) platform that discovers, classifies, and continuously monitors your sensitive data across cloud services. By leveraging advanced AI-powered Named Entity Recognition (NER) models, Cyera automatically identifies PHI patterns in your Databricks environment—including patient names, medical record numbers, diagnosis codes, and treatment information—ensuring you stay ahead of accidental exposures and meet HIPAA audit requirements in real time.
Step-by-Step Guide
Ensure Unity Catalog is enabled in your account and create a service principal with the minimum required privileges for PHI discovery.
In the Cyera portal, navigate to Integrations → DSPM → Add new. Select Databricks, provide your host URL and service principal details, then define the scan scope with healthcare-specific data classification rules.
Configure webhooks or streaming exports to push PHI detection results into your healthcare SIEM or Security Operations Center. Link findings to existing compliance management systems like Epic, Cerner, or other EHR platforms.
Review the initial PHI detection report, prioritize tables with large volumes of patient data, and adjust detection rules to reduce false positives while maintaining HIPAA compliance. Schedule recurring scans to maintain continuous visibility.
Architecture & Workflow
Databricks Unity Catalog
Source of metadata for healthcare tables and files
Cyera Connector
Pulls metadata and samples data for PHI classification
Cyera AI Engine
Applies NER models and healthcare-specific detection rules
HIPAA Reporting
Compliance dashboards, alerts, and audit trails
Data Flow Summary
Best Practices & Tips
Performance Considerations
- Start with incremental scans for large EHR datasets
- Use intelligent sampling for massive patient tables
- Optimize scan schedules during low-activity periods
Tuning PHI Detection Rules
- Maintain allowlists for synthetic/test patient data
- Adjust confidence thresholds for medical terminology
- Configure region-specific healthcare identifiers
Common Pitfalls
- Missing encrypted PHI in blob storage
- Over-scanning development/staging environments
- Neglecting to audit service principal access logs