Skip to main content

Command Palette

Search for a command to run...

Terraform Stops at the Metastore: Managing Unity Catalog with Jinja2 and CI/CD

Updated
18 min read
N
Data Platform Engineer / de-facto Architect at a large manufacturing company. Building enterprise-grade Databricks platforms on Azure under real organizational constraints. Writing about governance design, IaC, and architectural decision-making.

The Question Most Teams Skip

Databricks Terraform provider supports databricks_catalog, databricks_schema, and databricks_grant. You can manage Unity Catalog entirely in Terraform. So why didn't I?

Because can and should are different questions, and conflating them is how you end up with data engineers filing infra tickets every time they need a new schema — and permission changes blocked on a team that does not understand Unity Catalog's privilege model.

This is Part 2 of a series on CI/CD for a mock Databricks data platform on Azure. Part 1 covered the infrastructure layer: Azure resources, Databricks Workspace, and Metastore setup via Terraform. This post covers the Platform Layer — Catalog, Schema, and Permissions — and why I drew the tool boundary exactly where I did.


Where Terraform's Ownership Ends

The core principle: tool ownership should mirror team ownership.

In any organization working with Databricks on Azure, there are at least two distinct concerns:

  • An infrastructure team that manages Azure resources, networking, RBAC, and the Databricks Workspace

  • A data platform team that manages the data layer — catalogs, schemas, permissions, and eventually jobs and pipelines

Terraform is the right tool for infrastructure. It has state management, drift detection, and plan/apply workflow that makes sense for long-lived infrastructure objects. But ask a data engineer to write a databricks_grant Terraform resource, understand what terraform import means when a schema already exists, or navigate a Terraform plan review — and you are handing them the wrong tool for their job.

The decision, codified in ADR-001:

Resource Tool Reason
Azure Resources Terraform Network + security belong to infra team
Metastore Terraform One-time setup; state management is critical
Catalog / Schema Jinja2 + SQL Data engineers can own it; no Terraform expertise needed
Permissions / Grants Jinja2 + SQL Same lifecycle as the objects they govern
Tables SQL DDL (planned) Data layer ownership stays with data team

The trade-off for moving to Jinja2 + SQL: no drift detection. Terraform plan shows exactly what will change before it happens. SQL CREATE IF NOT EXISTS is idempotent — re-running is always safe — but it will not detect deletions. If someone drops a schema manually, the platform layer has no equivalent of terraform plan to surface that discrepancy. Drift is detected only at apply time, when the missing object would be recreated. This is an accepted limitation. The trade-off is that data engineers gain autonomy without needing to learn Terraform.


Three Approaches — Two of Them Wrong for This Context

Before settling on Jinja2 + SQL, I evaluated three options. In a different organizational context — a mature platform team fluent in Terraform, for instance — all three might have merit. In this one, a small team with mixed tooling maturity where the data platform team should not need Terraform knowledge to ship a schema change, two of them were wrong.

Option 1: Full Terraform for everything

This works technically. databricks_catalog, databricks_schema, and databricks_grant are mature resources. The problem is operational.

Every time a data engineer needs a new schema, they must open a Terraform file, add a resource block, and submit a PR through the infra team's review process. Every permission change follows the same path. The infra team has different priorities and a different release cadence. The data team is blocked.

Even in a solo or small-team context where one person owns both layers, the mental model mismatch remains: Terraform's plan/apply cycle is optimized for infrastructure that changes infrequently, not for schema and permission changes that happen routinely.

Rejected because it misaligns tool ownership with team ownership.

Option 2: Raw SQL notebooks without parametrization

Write a SQL notebook, hardcode catalog and schema names, run manually or via a job.

Problems surface immediately. Three environments mean three separate notebooks or f-string interpolation everywhere. f-strings are not a templating system — they offer no separation between template and rendering logic, no way to review the final SQL before execution, and no meaningful diff in code review.

Rejected because it does not scale beyond trivial DDL.

Option 3: Jinja2 + SQL with YAML config (the decision)

Jinja2 is a proper templating engine. The rendered SQL is reviewable before execution. Data engineers can read and modify .sql.j2 files without understanding Python beyond the basics.

But the implementation went further than a single parametrized template. The platform layer uses YAML config files as the source of truth and dedicated Jinja2 templates per concern — one for catalog creation, one for schema creation, one for permissions. The notebook is a generic runner that reads configs and calls templates in dependency order.

Why the separation? A monolithic template parametrized with a variable couples the template structure to that variable's semantics. When those semantics change, the template changes too. Separating config from template makes each part independently editable by the right person.


The Implementation

One Catalog, Not Three

An early design sketch had environment-specific catalogs: mock_dev, mock_staging, mock_prod. A Jinja2 {{ env }} variable would render the catalog name based on a CI parameter.

The actual implementation landed differently: a single catalog named mock.

The reason is scope clarity. The mock platform demonstrates architectural patterns — the config-driven deployment model, the group-based permission structure, the CI/CD execution policy. Adding environment-differentiated catalogs would add operational complexity without validating any additional architectural principle.

platform/configs/catalog_schema.yaml:

catalog:
  name: mock
  comment: "Mock data platform catalog"

schemas:
  - name: bronze
    comment: "Raw ingested data"
  - name: silver
    comment: "Cleaned and validated data"
  - name: gold
    comment: "Business-ready aggregations"

The dev and prod targets in the Asset Bundle are not about separate catalogs. They control deployment mode: mode: development prefixes the job name with the deploying principal's username (useful in shared workspaces for disambiguation); mode: production produces a clean job name. Both targets deploy to the same workspace and the same mock catalog.

In a production multi-workspace design — which the README documents as the target architecture — you would have separate workspaces per environment, each running the same platform layer templates against their own catalog. The catalog name would move into the config, and the same templates would deploy against each workspace. The single-catalog simplification in the mock does not invalidate the approach. It defers multi-environment catalog management to when there are actually multiple workspaces to deploy against.

The Configs: Declarative Intent

grants.yaml declares the entire access model:

catalog_grants:
  - group: data_platform_admins
    privileges: [USE_CATALOG, CREATE_SCHEMA]
  - group: data_engineers
    privileges: [USE_CATALOG]
  - group: data_consumers
    privileges: [USE_CATALOG]

schema_grants:
  bronze:
    - group: data_platform_admins
      privileges: [USE_SCHEMA, SELECT, MODIFY, CREATE_TABLE, CREATE_FUNCTION]
    - group: data_engineers
      privileges: [USE_SCHEMA, SELECT, MODIFY, CREATE_TABLE]
  silver:
    - group: data_platform_admins
      privileges: [USE_SCHEMA, SELECT, MODIFY, CREATE_TABLE, CREATE_FUNCTION]
    - group: data_engineers
      privileges: [USE_SCHEMA, SELECT, MODIFY, CREATE_TABLE]
  gold:
    - group: data_platform_admins
      privileges: [USE_SCHEMA, SELECT, MODIFY, CREATE_TABLE, CREATE_FUNCTION]
    - group: data_engineers
      privileges: [USE_SCHEMA, SELECT, MODIFY]
    - group: data_consumers
      privileges: [USE_SCHEMA, SELECT]

GRANTs are at Schema level, not Catalog level. Within a single catalog, bronze (raw, potentially PII-bearing ingestion data) and gold (curated, business-ready aggregations) carry different access requirements. Catalog-level grants would over-provision bronze for groups that should only read gold.

The differences between schemas are intentional: data_engineers have CREATE_TABLE on bronze and silver but not gold. data_consumers do not appear in bronze or silver at all. These are not omissions — they are the access model.

The Templates: One Concern Each

create_catalog.sql.j2:

-- ADR-001: Catalog is managed in Platform Layer, NOT Terraform
-- ADR-003: Idempotent by construction

CREATE CATALOG IF NOT EXISTS {{ catalog.name }}
  MANAGED LOCATION '{{ managed_location }}'
  COMMENT '{{ catalog.comment }}';

create_schema.sql.j2:

-- ADR-001: Schema is managed in Platform Layer, NOT Terraform
-- ADR-003: Idempotent by construction

{% for schema in schemas %}
CREATE SCHEMA IF NOT EXISTS {{ catalog.name }}.{{ schema.name }}
  COMMENT '{{ schema.comment }}';
{% endfor %}

grant_permissions.sql.j2:

-- ADR-005: Schema-level GRANT -- group-based, declarative
-- ADR-003: GRANT re-execution is safe (idempotent per Databricks specification)

-- Catalog-level grants
{% for grant in catalog_grants %}
GRANT {{ grant.privileges | join(', ') }} ON CATALOG `{{ catalog.name }}` TO `{{ grant.group }}`;
{% endfor %}

-- Schema-level grants
{% for schema_name, grants in schema_grants.items() %}
{% for grant in grants %}
GRANT {{ grant.privileges | join(', ') }} ON SCHEMA `{{ catalog.name }}`.`{{ schema_name }}` TO `{{ grant.group }}`;
{% endfor %}
{% endfor %}

Each template is readable without knowing the rendering context. A data engineer can modify grants.yaml, preview the SQL the template would generate, and open a PR — without touching the notebook or the CI workflow.

The Notebook: A Generic Runner

setup_platform.py receives two parameters and handles the rest:

dbutils.widgets.text("storage_account_name", "")
dbutils.widgets.text("uc_root_container", "")

storage_account_name = dbutils.widgets.get("storage_account_name")
uc_root_container = dbutils.widgets.get("uc_root_container")

No env widget. The catalog name comes from catalog_schema.yaml. The only runtime inputs are the storage account parameters, which change across workspace recreations and are read from Terraform output at deploy time.

Path resolution uses the same principal-agnostic approach as Part 1:

notebook_path = (
    dbutils.notebook.entry_point
    .getDbutils().notebook().getContext().notebookPath().get()
)
bundle_root = os.path.dirname(os.path.dirname(notebook_path))
configs_dir = f"/Workspace{bundle_root}/configs"
templates_dir = f"/Workspace{bundle_root}/templates"

Hardcoding paths would tie the notebook to a specific service principal username and deployment prefix. After a workspace destroy and recreate, the service principal ID changes and a hardcoded path silently breaks. notebookPath() is principal-agnostic and requires zero configuration.

The render_and_execute helper strips comment lines, splits on ;, and executes each statement via spark.sql(). The key behavioral flag is warn_on_principal_missing:

except Exception as e:
    if warn_on_principal_missing and "PRINCIPAL_DOES_NOT_EXIST" in str(e):
        print(f"WARNING: principal not found -- skipping grant.\n"
              f"  Create the group as an account-level group first.\n"
              f"  Details: {e}\n")
    else:
        raise

Unity Catalog GRANT statements require account-level groups. If a group does not yet exist at the account level, GRANT raises PRINCIPAL_DOES_NOT_EXIST. Rather than failing the job, the notebook emits a warning and continues. The job succeeds; the skipped grants are re-applied after the groups are created. Since all GRANT statements are idempotent, re-running is always safe.

Templates are called in dependency order:

render_and_execute("create_catalog.sql.j2", {"catalog": catalog, "managed_location": managed_location})
render_and_execute("create_schema.sql.j2", {"catalog": catalog, "schemas": schemas})
render_and_execute("grant_permissions.sql.j2", {
    "catalog": catalog,
    "catalog_grants": grants_config["catalog_grants"],
    "schema_grants": grants_config["schema_grants"],
}, warn_on_principal_missing=True)

Both Jinja2 and PyYAML are available in Databricks Runtime 11+. PyYAML is declared explicitly as a PyPI library in databricks.yml regardless — making dependencies visible in the bundle configuration guards against silent breakage if pre-installed versions change across runtime upgrades.

The Bundle: Two Targets

platform/databricks.yml:

bundle:
  name: mock-platform-catalog

variables:
  storage_account_name:
    description: "ADLS Gen2 storage account name (from workload-azure TF output)"
    default: ""
  uc_root_container:
    description: "ADLS container for UC metastore root (from workload-azure TF output)"
    default: ""

targets:
  dev:
    mode: development
  prod:
    mode: production

No staging target. With a single mock catalog, a third bundle target adds no value. The dev and prod targets differ only in deployment mode semantics — job name prefix for disambiguation in dev, clean name in prod.

The cluster is single-node — DDL statements execute serially with no distributed computation:

new_cluster:
  num_workers: 0
  spark_conf:
    spark.master: "local[*, 4]"
    spark.databricks.cluster.profile: "singleNode"
  custom_tags:
    ResourceClass: SingleNode

Permissions Are Platform Work Too

Unity Catalog permissions belong in the Platform Layer, not in Terraform.

The reasoning in ADR-005: Catalog, Schema, and their GRANTs share the same lifecycle — created, modified, and retired together by the data platform team. Placing GRANTs in Terraform while the objects they govern are managed in SQL splits responsibility for the same resource across two tools and two teams. That breaks the boundary principle ADR-001 established.

The permission model has three groups:

groups:
  - name: data_platform_admins
  - name: data_engineers
  - name: data_consumers

data_platform_admins: Full access across all schemas. Responsible for catalog operations, grant management, audit, and troubleshooting.

data_engineers: Full DDL access on bronze and silver (including CREATE_TABLE); data modification rights on gold (no CREATE_TABLE). The intent: ETL jobs write to gold via CI/CD jobs running as a service principal; engineers do not create tables there directly.

data_consumers: USE_SCHEMA and SELECT on gold only. Not present in bronze or silver.

Permissions are assigned to groups only — never to individual users. Individual grants fail at scale in three ways:

  • Offboarding: removing a user from the Entra ID group revokes all Databricks access automatically. Individual grants require manual cleanup per resource.

  • Audit: "Who has read access to the gold schema?" is answered by listing data_consumers members — one query. With individual grants, the answer requires enumerating all grant records across every resource.

  • Onboarding: adding a new team member to the correct Entra ID group grants all necessary access in one operation. Individual grants require updating every resource separately.

Group structure and GRANT statements are declared in the repository and managed via CI/CD. Group membership — who belongs to each group — is managed outside the repository. In the mock environment, membership is managed via Databricks CLI or Account Console. In production, Entra ID Native Sync propagates group membership automatically when users authenticate with their Microsoft account. Native Sync is preferred over SCIM because it supports nested groups and requires no external SCIM application configuration.


The CI/CD Workflow

workload-catalog.yaml has two triggers:

on:
  push:
    branches: [ "main" ]
    paths: [ "platform/**" ]
  workflow_dispatch: {}

push to main deploys to prod. workflow_dispatch (with no inputs) always targets dev. There is no target selection dropdown — removed when the staging target was eliminated.

- name: Set bundle target
  id: target
  run: |
    if [ "${{ github.event_name }}" = "push" ]; then
      echo "TARGET=prod" >> $GITHUB_OUTPUT
    else
      echo "TARGET=dev" >> $GITHUB_OUTPUT
    fi

There is no pull_request: trigger. Azure federated credentials are configured only for push and workflow_dispatch OIDC subjects — a PR event generates a different subject that is not in the allow-list. Beyond the technical constraint: databricks bundle validate checks syntax but does not preview SQL execution or workspace object state. A PR check that validates YAML structure but not DDL semantics would create false confidence rather than useful signal.

The Preflight Check

Before deploying, the workflow verifies the workspace is Unity Catalog-ready:

- name: Preflight -- verify UC external location exists
  run: |
    if ! RAW=$(databricks external-locations list --output json 2>&1); then
      echo "::error::Preflight failed: $RAW"
      echo "  workload-dbx must be applied before workload-catalog."
      echo "  See docs/runbooks/destroy-recreate.md for the required deployment order."
      exit 1
    fi
    COUNT=\((echo "\)RAW" | jq '[.[] | select(.name == "uc-root-location")] | length' 2>/dev/null || echo "0")
    if [ "$COUNT" -eq 0 ]; then
      echo "::error::UC external location 'uc-root-location' not found."
      echo "  workload-dbx must be applied before workload-catalog."
      exit 1
    fi
    echo "Preflight passed: external location 'uc-root-location' found."

This guard exists because workload-catalog depends on workload-dbx — the Metastore and its External Location must exist before catalog DDL can run. Without the preflight, a missing Metastore surfaces as AnalysisException: METASTORE_DOES_NOT_EXIST five minutes into the bundle run, after the cluster has already started.

The exit code is captured explicitly. An earlier version had a shell bug where a non-zero CLI exit caused an empty COUNT, [ "" -eq 0 ] returned exit 2 (integer expression expected), and the if context treated that as false — printing "Preflight passed" when the Metastore was absent. The fix captures stdout and stderr together and checks the CLI exit code before parsing the output.

Storage parameters are read from Terraform state at deploy time — the same pattern as Part 1:

- name: Capture workload-azure outputs
  id: azout
  run: |
    echo "STORAGE_ACCOUNT_NAME=\((terraform -chdir=infra/workload-azure output -raw storage_account_name)" >> \)GITHUB_OUTPUT
    echo "UC_ROOT_CONTAINER=\((terraform -chdir=infra/workload-azure output -raw uc_root_container)" >> \)GITHUB_OUTPUT

If the storage account changes after a workspace recreate, the catalog workflow picks it up automatically.

Deploy and run are sequential and explicit:

- name: Bundle deploy
  working-directory: platform
  run: |
    databricks bundle deploy --target ${{ steps.target.outputs.TARGET }} \
      --var "storage_account_name=${{ steps.azout.outputs.STORAGE_ACCOUNT_NAME }}" \
      --var "uc_root_container=${{ steps.azout.outputs.UC_ROOT_CONTAINER }}"

- name: Bundle run -- setup_platform
  working-directory: platform
  run: |
    databricks bundle run setup_platform \
      --target ${{ steps.target.outputs.TARGET }} \
      --var "storage_account_name=${{ steps.azout.outputs.STORAGE_ACCOUNT_NAME }}" \
      --var "uc_root_container=${{ steps.azout.outputs.UC_ROOT_CONTAINER }}" \
      --no-wait=false

--no-wait=false blocks the workflow until the job completes. Without it, CI exits immediately after triggering the run, and a failed DDL execution shows green.


What Cannot Be Automated

Two categories of manual prerequisites survive every destroy-recreate cycle.

Step 1 — SP Grants (run after workload-dbx, before workload-catalog)

A human user with the metastore admin role must run these two grants in a Databricks SQL warehouse or notebook:

-- Required for workload-dbx to create the External Location
GRANT CREATE EXTERNAL LOCATION ON METASTORE TO '<SP_client_id>';

-- Required for workload-catalog to create Catalogs via the Jinja2 DDL notebook
GRANT CREATE CATALOG ON METASTORE TO '<SP_client_id>';

<SP_client_id> is the value of the AZURE_CLIENT_ID GitHub repository secret.

The service principal running CI cannot grant itself account-level Unity Catalog privileges — that requires a metastore admin. This is a security boundary, not a design limitation. A service principal capable of self-granting metastore-scope privileges would be a significant escalation risk.

Verify after running:

SHOW GRANTS ON METASTORE;
-- SP should appear with CREATE EXTERNAL LOCATION and CREATE CATALOG

Step 2 — Account-Level Groups (before or after workload-catalog)

Unity Catalog GRANT statements require account-level groups. Workspace-local groups — created via the workspace SCIM API — are not visible to UC GRANT. They live in a different scope.

The recommended method is the Account Console GUI (Databricks Account Console → User Management → Groups → Add Group). Create:

  • data_platform_admins

  • data_engineers

  • data_consumers

If you prefer the CLI, use the databricks account subcommand — the workspace-level databricks groups command targets the wrong API and creates workspace-local groups, not account-level ones:

databricks account groups create --display-name data_platform_admins --profile account
databricks account groups create --display-name data_engineers --profile account
databricks account groups create --display-name data_consumers --profile account

Note: the Databricks CLI does not implement account-level member addition. To add a member via CLI, you must call the Account SCIM API directly with a Databricks-scoped token — see the platform runbook for the full curl command.

Groups can be created before or after workload-catalog runs. The setup_platform job attempts all GRANT statements and emits a WARNING (not an error) for any group that does not yet exist at the account level. Re-run workload-catalog after creating the groups — all GRANT statements are idempotent.

The Destroy-Recreate Pattern

For cost management, the mock platform is routinely torn down between active sessions. Destroy order is mandatory (full procedure in docs/runbooks/destroy-recreate.md):

1. Destroy workload-dbx first
   Removes: UC Metastore, Storage Credential, External Location
2. Destroy workload-azure second
   Removes: Databricks Workspace, ADLS, Access Connector

Destroying Azure before Databricks orphans Unity Catalog account-scope objects. The next workload-dbx apply fails with a "Storage Credential already exists" error that requires manual cleanup in the Account Console.

The recreate order for the catalog layer specifically: workload-azure -> workload-dbx -> Step 1 SP grants -> workload-catalog (Step 2 groups can follow at any point). The preflight guard enforces the workload ordering — workload-catalog will fail fast and explicitly if triggered before workload-dbx completes.


What This Unlocks

A data engineer can add a new schema by editing two files — catalog_schema.yaml (add the schema entry) and grants.yaml (add the access policy). They open a PR, merge it, and CI deploys the changes. No infra team involvement. No Terraform plan to review. No state file to reason about. The new schema exists, with the correct permissions applied, after the merge.

Permissions are managed the same way — as code, in the same PR, reviewable, auditable, idempotent. "Who has access to the gold schema?" is answered by reading grants.yaml. What actually ran in production is in Databricks system tables (system.access.audit).

The infra team retains full control of Azure resources and the Metastore. The data platform team moves independently within their layer. The boundary is enforced in tooling, not in process.

That is the point. And because the templates are plain text with deterministic output, verifying a grants.yaml change before it runs in CI — rendering the template locally and asserting the SQL — is a testable concern worth its own treatment. That is coming in the Quality Engineering series.


Series Navigation

The mock platform repository is at github.com/nobhri/azure-dbx-mock-platform. ADRs referenced throughout: ADR-001 and ADR-005.