Databricks

Customers on an Enterprise or Growth plan can access Data Pipeline as an add-on package. See our pricing page for more details.

The Databricks export destination is currently in beta.

Export your Mixpanel data to Databricks using Unity Catalog Managed Volumes. This integration supports all Databricks clouds (AWS, GCP, and Azure).

Design

Mixpanel exports data to customer’s Databricks workspace using Unity Catalog Managed Volumes. We first load the data into a single-column raw (VARIANT type) data table, then create a view to expose all properties as typed columns.

Supported Features

  • Cross-cloud Support: Works with Databricks on AWS, GCP, or Azure
  • Date Partitioning: Raw tables are clustered by event_date for efficient date-based queries
  • Static IP Addresses: Supports IP allowlisting for secure connections

IP Allowlist

Mixpanel Data Pipelines supports static IP addresses for Databricks connections when IP restrictions are configured on your Databricks workspace. If you are using network policies to restrict access to your instance, you might need to add the following IP addresses to the allowed list:

US

34.31.112.201
35.184.21.33
35.225.176.74

EU

34.147.68.192
35.204.164.122
35.204.177.251

Prerequisites

Before setting up the integration, ensure you have:

  • A Databricks workspace with Unity Catalog enabled
  • Databricks Runtime 15.3+ (for VARIANT type support)
  • A SQL Warehouse (Serverless recommended for best performance and cost) or all-purpose compute cluster
  • Admin permissions in your Databricks workspace to create Service Principals

Set Export Permissions

Step 1: Create a Service Principal

A Service Principal is a Databricks identity that Mixpanel will use to access your workspace.

  1. In your Databricks workspace, navigate to SettingsIdentity and accessService principals
  2. Click Add service principal
  3. Enter a name (e.g., mixpanel-export)
  4. Click Add
  5. Note the Application ID (Client ID) - you’ll need this later

Step 2: Generate OAuth Secret

  1. Click on the Service Principal you just created
  2. Navigate to the Secrets tab
  3. Click Generate secret
  4. Copy the Secret value immediately - it won’t be shown again
  5. Store it securely - you’ll need it for Mixpanel configuration

Step 3: Create Catalog and Schema

Create a dedicated catalog and schema for Mixpanel data (or use existing ones).

CREATE CATALOG IF NOT EXISTS mixpanel_export
COMMENT 'Mixpanel data exports';
 
-- Create schema
CREATE SCHEMA IF NOT EXISTS mixpanel_export.json_pipelines;

Step 4: Grant Permissions to Service Principal

Grant the Service Principal required permissions to operate within the catalog.

GRANT USE CATALOG ON CATALOG mixpanel_export
  TO `<service-principal-application-id>`;
 
GRANT USE SCHEMA ON SCHEMA mixpanel_export.json_pipelines
  TO `<service-principal-application-id>`;
 
GRANT CREATE TABLE ON SCHEMA mixpanel_export.json_pipelines
  TO `<service-principal-application-id>`;
 
GRANT CREATE VOLUME ON SCHEMA mixpanel_export.json_pipelines
  TO `<service-principal-application-id>`;

For Metastore Privilege Version 1.1+ only, additionally grant:

GRANT CREATE VIEW ON SCHEMA mixpanel_export.json_pipelines
  TO `<service-principal-application-id>`;

To check your metastore privilege version:

DESCRIBE METASTORE;

Look for PrivilegeModelVersion:

  • Version 1.0: CREATE TABLE covers both tables and views
  • Version 1.1+: Separate CREATE VIEW privilege required

Why These Permissions?

  • USE CATALOG: Required to access the catalog
  • USE SCHEMA: Required to access objects in the schema
  • CREATE TABLE: Create raw tables to store event data
  • CREATE VIEW: Create views with typed columns (metastore v1.1+ only)
  • CREATE VOLUME: Create temporary volumes for uploading files

Note: The Service Principal automatically becomes the owner of all objects it creates (tables, views, volumes), which grants it full permissions (SELECT, MODIFY, READ FILES, WRITE FILES, etc.) on those objects. No additional grants needed.

Step 5: Grant Compute Access

The Service Principal needs permission to use compute resources to execute queries. You can use either a SQL Warehouse or an All-Purpose Compute cluster.

  1. In your Databricks workspace, navigate to SQL Warehouses
  2. Click on your SQL Warehouse (or create one if needed)
  3. Go to the Permissions tab
  4. Click Add or Grant permissions
  5. Search for your Service Principal by Application ID (from Step 1)
  6. Select permission level: Can use (minimum required)
  7. Click Save

Why this is needed: Compute resources have their own access control separate from data permissions. Even with all catalog/schema/table permissions, the Service Principal cannot execute queries without compute access.

Step 6: Get Compute Connection Details

  1. In the same SQL Warehouse, go to the Connection details tab
  2. Note the following values:
    • Server hostname: e.g., abc123.cloud.databricks.com
    • HTTP Path: e.g., /sql/1.0/warehouses/xyz789

Recommendation: Use a Serverless SQL Warehouse for:

  • Fast startup (~3 seconds)
  • Auto-scaling
  • Pay-per-use pricing
  • No idle cluster costs

Step 7: Configure Mixpanel Integration

Refer to Step 2: Creating the Pipeline to create data pipeline via UI. You’ll need to provide:

  • Server Hostname (from Step 6)
  • HTTP Path (from Step 6)
  • Catalog (from Step 3, e.g., mixpanel_export)
  • Schema (from Step 3, e.g., json_pipelines)
  • Service Principal ID (Application ID from Step 1)
  • Service Principal Secret (from Step 2)

Partitioning

Raw tables are clustered by the event_date column, which is computed in your project’s timezone during data load. This clustering significantly improves query performance when filtering by date.

Data Schema

Mixpanel creates a raw table and a view with typed columns:

Raw Table Columns (mp_master_event_raw):

  • DATA (VARIANT) - Contains the complete event JSON
  • event_date (DATE) - Computed from event time in your project’s timezone

View Columns (mp_master_event):

  • user_id (STRING)
  • time (TIMESTAMP)
  • properties (VARIANT) - All event properties as semi-structured data
  • insert_id (STRING)
  • event_name (STRING)
  • distinct_id (STRING)
  • device_id (STRING)
  • event_date (DATE)

Queries

Databricks supports a VARIANT type that can store semi-structured JSON data. Use the :: syntax to extract and cast properties from VARIANT columns.

Basic event query

SELECT count(*)
FROM mixpanel_export.json_pipelines.mp_master_event
WHERE event_name = 'sign up';

Query nested properties

SELECT
    event_name,
    time,
    distinct_id,
    properties:button_name::STRING AS button_name,
    properties:product:category::STRING AS product_category,
    properties:price::DOUBLE AS price
FROM mixpanel_export.json_pipelines.mp_master_event
WHERE event_name = 'Purchase'
  AND properties:product:category::STRING = 'Electronics';

Efficient date filtering

Use the event_date column for best performance:

SELECT
    event_name,
    time,
    distinct_id,
    properties:button_name::STRING AS button_name
FROM mixpanel_export.json_pipelines.mp_master_event
WHERE event_date BETWEEN '2024-01-01' AND '2024-01-31'
  AND event_name = 'Purchase'
ORDER BY time DESC;

Getting the number of events per day

SELECT
    event_date,
    count(*) as event_count
FROM mixpanel_export.json_pipelines.mp_master_event
WHERE event_date >= '2024-01-01' AND event_date <= '2024-01-31'
GROUP BY event_date
ORDER BY event_date;

Costs

  • Delta tables: Billed by your cloud provider (AWS S3, GCP GCS, or Azure ADLS) via Databricks
  • Managed volumes: Temporary storage cleaned up after each export
  • Compute: SQL Warehouse usage during COPY INTO operations

Was this page useful?