Authentication overview

Databricks offers several authentication methods to secure access to its platform, APIs, and underlying cloud resources. The choice of method typically depends on the client (interactive user, automated script, or application), the cloud provider (AWS, Azure, GCP), and the desired level of granularity and automation. Authentication mechanisms ensure that only authorized entities can interact with Databricks workspaces, clusters, notebooks, and data.

The Databricks platform integrates with identity providers (IdPs) native to the major cloud environments, allowing for centralized identity management. For programmatic access, such as calling the Databricks REST API or using the Databricks SDKs, specific credentials like personal access tokens or OAuth 2.0 tokens are utilized. Understanding the appropriate authentication flow is critical for secure and efficient operation within the Databricks Lakehouse Platform.

Supported authentication methods

Databricks supports a range of authentication methods, each designed for specific use cases and security requirements. The primary methods include:

  • Personal Access Tokens (PATs): Simple for individual users and scripts, offering a quick way to authenticate to the Databricks REST API or the Databricks CLI. PATs are long-lived credentials and require careful management.
  • OAuth 2.0: Recommended for applications and automated tools that require short-lived, refreshable tokens. This method provides a more secure alternative to PATs for service-to-service communication and adheres to modern authorization standards, as defined by the OAuth 2.0 specification.
  • Azure Active Directory (AAD) Service Principals: For Databricks on Azure, service principals enable applications and services to authenticate without relying on user credentials. This integrates with Azure's identity and access management system, allowing for role-based access control (RBAC).
  • Google Cloud Service Accounts: On Databricks on Google Cloud, service accounts provide a similar function to Azure AD service principals, allowing applications to authenticate using keys or workload identity federation. This leverages Google Cloud's authentication mechanisms.
  • AWS IAM Roles/Users: For Databricks on AWS, authentication can leverage AWS Identity and Access Management (IAM) roles and users, particularly for cross-account access or when integrating with other AWS services. This method uses temporary credentials provided by AWS STS (Security Token Service).
  • Machine-to-Machine (M2M) Authentication: Databricks offers specific M2M authentication flows, often building on OAuth 2.0 or cloud-native service identities, for automated workflows and integrations.

The following table summarizes the key authentication methods:

Method When to Use Security Level
Personal Access Tokens (PATs) Individual user scripts, CLI access, quick prototyping Medium (requires careful handling, long-lived)
OAuth 2.0 (User-bound) Interactive applications, user-initiated workflows High (short-lived, refreshable, user consent)
OAuth 2.0 (Client Credentials) Service-to-service communication, automated applications High (short-lived, no user interaction, client secret/certificate)
Azure AD Service Principals Applications, automation on Azure; RBAC integration High (integrated with Azure IAM, managed secrets)
Google Cloud Service Accounts Applications, automation on GCP; IAM integration High (integrated with Google Cloud IAM, key management)
AWS IAM Roles/Users Applications, automation on AWS; cross-account access High (integrated with AWS IAM, temporary credentials)

Getting your credentials

The process for obtaining credentials varies by authentication method:

Personal Access Tokens (PATs)

  1. Log in to your Databricks workspace.
  2. Navigate to User Settings > Developer > Access Tokens.
  3. Click "Generate new token."
  4. Provide a comment and an optional lifetime for the token.
  5. Copy the generated token immediately, as it will not be shown again. Store it securely. For detailed steps, refer to the Databricks PAT documentation.

OAuth 2.0

Implementing OAuth 2.0 involves registering an application within Databricks (or your cloud IdP) and configuring the necessary redirect URIs and scopes. The flow typically involves:

  1. Registering an OAuth Client: In the Databricks account console or workspace, register an OAuth application. This provides a client ID and client secret.
  2. Authorization Request: Your application redirects the user to Databricks for authorization, requesting specific scopes.
  3. Authorization Grant: Upon user approval, Databricks redirects the user back to your application with an authorization code.
  4. Token Exchange: Your application exchanges the authorization code for an access token (and optionally a refresh token) by making a request to the Databricks token endpoint, using the client ID and secret.

For machine-to-machine (client credentials) flows, the application directly exchanges its client ID and secret for an access token without user interaction. Consult the Databricks OAuth documentation for specific implementation details.

Cloud-specific Service Principals/Accounts/IAM

For cloud-native authentication methods:

  • Azure AD Service Principals: Create an application registration in Azure Active Directory, generate a client secret or upload a certificate, and assign appropriate roles in Databricks. Refer to Microsoft's guide on Azure Databricks service principal authentication.
  • Google Cloud Service Accounts: Create a service account in your Google Cloud project, generate a JSON key file, and grant necessary permissions within Databricks.
  • AWS IAM Roles/Users: Configure IAM roles or users in AWS with policies that grant access to Databricks resources. Utilize AWS CLI or SDKs to assume roles or obtain temporary credentials.

Authenticated request example

This example demonstrates how to make an authenticated request to the Databricks REST API using a Personal Access Token (PAT). This Python example uses the requests library to list all clusters in a Databricks workspace.


import requests
import os

# Replace with your Databricks workspace URL and PAT
databricks_host = "https://<your-workspace-url>"
# It's recommended to store sensitive information like PATs in environment variables
databricks_pat = os.environ.get("DATABRICKS_PAT") 

if not databricks_pat:
    raise ValueError("DATABRICKS_PAT environment variable not set.")

headers = {
    "Authorization": f"Bearer {databricks_pat}",
    "Content-Type": "application/json"
}

response = requests.get(f"{databricks_host}/api/2.0/clusters/list", headers=headers)

if response.status_code == 200:
    clusters = response.json()
    print("Clusters in workspace:")
    for cluster in clusters.get("clusters", []):
        print(f"- {cluster['cluster_name']} (ID: {cluster['cluster_id']})")
else:
    print(f"Error: {response.status_code} - {response.text}")

For OAuth 2.0, the process would involve obtaining an access token first, then using that token in the Authorization: Bearer header, similar to the PAT example, but with a more involved token acquisition step.

Security best practices

Adhering to security best practices is essential when managing authentication for Databricks:

  • Least Privilege: Grant only the minimum necessary permissions to users, service principals, or tokens. Regularly review and adjust permissions.
  • Rotate Credentials: Implement a regular rotation schedule for all credentials, especially Personal Access Tokens and client secrets. For PATs, consider setting a short lifetime when generating them.
  • Secure Storage: Never hardcode credentials in code. Use secure methods like environment variables, cloud secret managers (e.g., AWS Secrets Manager, Azure Key Vault, Google Secret Manager), or Databricks secret scopes to store and retrieve sensitive information.
  • Multi-Factor Authentication (MFA): Enable MFA for all user accounts accessing Databricks through your identity provider.
  • IP Access Lists: Configure IP access lists to restrict access to your Databricks workspace to trusted networks only.
  • Audit Logging: Regularly review Databricks audit logs to monitor authentication attempts, access patterns, and resource modifications. This helps detect and respond to suspicious activity.
  • SCIM Provisioning: Utilize System for Cross-domain Identity Management (SCIM) with your identity provider to automate user and group provisioning and de-provisioning. This ensures that access is automatically revoked when an employee leaves the organization.
  • Use OAuth 2.0 for Applications: For programmatic access by applications or services, prefer OAuth 2.0 over PATs due to its support for short-lived, refreshable tokens and clearer separation of concerns.
  • Monitor for Anomalies: Implement monitoring and alerting for unusual authentication patterns, such as multiple failed login attempts, access from unexpected locations, or token usage outside of normal operating hours.