Databricks PHI Detection

Learn how to detect PHI (Protected Health Information) in Databricks environments. Follow step-by-step guidance for HIPAA compliance.

Why It Matters

The core goal is to identify every location where Protected Health Information (PHI) is stored within your Databricks environment, so you can remediate unintended exposures before they become breaches. Scanning for PHI in Databricks is a priority for healthcare organizations subject to HIPAA, as it helps you prove you've discovered and accounted for all sensitive patient data—mitigating the risk of unauthorized access and potential data exposure.

Primary Risk: Data exposure of Protected Health Information

Relevant Regulation: HIPAA (Health Insurance Portability and Accountability Act)

A thorough scan delivers immediate visibility, laying the foundation for automated policy enforcement and ongoing compliance.

Prerequisites

Permissions & Roles

  • Databricks admin or service principal
  • catalogs/read, schemas/read, tables/read privileges
  • Ability to install Databricks CLI or Terraform

External Tools

  • Databricks CLI
  • Cyera DSPM account
  • API credentials

Prior Setup

  • Databricks workspace provisioned
  • Unity Catalog enabled
  • CLI authenticated
  • Networking rules configured

Introducing Cyera

Cyera is a modern Data Security Posture Management (DSPM) platform that discovers, classifies, and continuously monitors your sensitive data across cloud services. By leveraging advanced AI-powered Named Entity Recognition (NER) models, Cyera automatically identifies PHI patterns in your Databricks environment—including patient names, medical record numbers, diagnosis codes, and treatment information—ensuring you stay ahead of accidental exposures and meet HIPAA audit requirements in real time.

Step-by-Step Guide

1
Configure your Databricks workspace

Ensure Unity Catalog is enabled in your account and create a service principal with the minimum required privileges for PHI discovery.

databricks configure --token

2
Enable PHI scanning workflows

In the Cyera portal, navigate to Integrations → DSPM → Add new. Select Databricks, provide your host URL and service principal details, then define the scan scope with healthcare-specific data classification rules.

3
Integrate with healthcare systems

Configure webhooks or streaming exports to push PHI detection results into your healthcare SIEM or Security Operations Center. Link findings to existing compliance management systems like Epic, Cerner, or other EHR platforms.

4
Validate results and tune PHI policies

Review the initial PHI detection report, prioritize tables with large volumes of patient data, and adjust detection rules to reduce false positives while maintaining HIPAA compliance. Schedule recurring scans to maintain continuous visibility.

Architecture & Workflow

Databricks Unity Catalog

Source of metadata for healthcare tables and files

Cyera Connector

Pulls metadata and samples data for PHI classification

Cyera AI Engine

Applies NER models and healthcare-specific detection rules

HIPAA Reporting

Compliance dashboards, alerts, and audit trails

Data Flow Summary

Enumerate Healthcare Data Send to Cyera Apply PHI Detection Generate Compliance Reports

Best Practices & Tips

Performance Considerations

  • Start with incremental scans for large EHR datasets
  • Use intelligent sampling for massive patient tables
  • Optimize scan schedules during low-activity periods

Tuning PHI Detection Rules

  • Maintain allowlists for synthetic/test patient data
  • Adjust confidence thresholds for medical terminology
  • Configure region-specific healthcare identifiers

Common Pitfalls

  • Missing encrypted PHI in blob storage
  • Over-scanning development/staging environments
  • Neglecting to audit service principal access logs