Why I Built a Mock Data Platform — and What the CI/CD Taught Me

There is a gap between "I know Terraform" and "I've designed a platform." The first is a skill. The second is a pattern of thinking — about ownership boundaries, failure modes, team capabilities, and the decisions you have to make before you write a single line of HCL.

I built azure-dbx-mock-platform to close that gap in my own portfolio. Not as a tutorial, not as a getting-started guide. As a design journal: a record of the architectural decisions I made, the alternatives I rejected, and what broke along the way.

This is Part 1 of a 3-part series. It covers the infrastructure foundation: Azure provisioning, Unity Catalog Metastore, and the CI/CD pipeline that ties them together. Parts 2 and 3 will cover catalog/schema management with Jinja2 and job authoring with Asset Bundles.

The Architecture in One Diagram

The platform is organized into four independent layers, each with its own Terraform state, its own workflow, and its own blast radius:

Bootstrap
  └── creates the tfstate backend (Storage Account + containers)

Guardrails
  └── subscription-level budget alert

Workload-Azure
  └── Resource Group, ADLS Gen2, Access Connector, Databricks Workspace

Workload-Databricks
  └── Unity Catalog Metastore, Storage Credential, External Location

Within each workspace, the layer separation continues:

+------------------------------------------+
| Azure Layer           (Terraform)        |
| VNet · Storage · RBAC · Workspace        |
+------------------------------------------+
| Databricks Account Layer  (Terraform)    |
| Metastore · Storage Credential           |
+------------------------------------------+
| Catalog / Schema Layer  (Jinja2 + SQL)   |
| Environment-parametrized DDL             |
+------------------------------------------+
| Job / Workflow Layer  (Asset Bundles)    |
| Idempotent ETL jobs                      |
+------------------------------------------+

Each layer has a different rate of change, different team ownership, and different failure blast radius. That last point is the one most people underweight.

Why four layers and not one?

A single Terraform workspace managing everything from the VNet to the catalog schemas creates hidden coupling that eventually breaks — usually in production. When a data engineer needs to add a schema, they shouldn't be touching the same state file as the VNet. When a destroy operation fails on a catalog object, it shouldn't put the entire Azure infrastructure in a locked state.

The four-layer separation enforces this at the tool level. A bug in workload-dbx can't corrupt workload-azure state. A destroy of the Databricks layer is a separate, intentional operation from destroying Azure resources. Blast radius is bounded by design, not by convention.

Terraform owns infra and metastore; catalog and schema are deliberately delegated to Jinja2 + SQL. The reason: a data engineer adding a schema shouldn't require Terraform expertise or infra team review. Tool ownership follows team ownership — the principle behind ADR-001, and the central theme of Part 2.

The production gaps this architecture knowingly accepts — no private networking, no multi-environment separation, no IP allowlist — are documented in Production Considerations below.

Authentication: Zero Stored Credentials

The single most important decision in this platform's CI/CD design: no stored secrets anywhere.

Every GitHub Actions workflow authenticates to Azure via OIDC federated identity. No service principal passwords. No client secrets. No credentials that rotate, leak, or get forgotten in .env files.

How it works:

GitHub Actions runner
  → presents OIDC token (signed by GitHub, scoped to this repo + branch)
  → Azure Entra ID validates the token against a configured federated credential
  → issues a short-lived access token for the Service Principal
  → Terraform uses this token via the azurerm provider

In the workflow file, it looks like this:

# .github/workflows/workload-azure.yaml
permissions:
  id-token: write   # required for OIDC
  contents: read

jobs:
  tf:
    steps:
      - name: Azure login (OIDC)
        uses: azure/login@v2
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

AZURE_CLIENT_ID is not a secret in the traditional sense — it's a non-sensitive identifier. The actual authentication is cryptographic, not credential-based.

ADR-002: Why not Service Principal secrets?

Secrets rotate. They get forgotten in GitHub Actions environment variables set months ago. They leak in CI logs when someone adds echo $ARM_CLIENT_SECRET for debugging. OIDC has no secret to manage, integrates natively with Entra ID, and scope is limited to the specific workflow and branch combination.

The OIDC subject gotcha

OIDC federated credentials in Azure require you to explicitly configure which GitHub Actions subjects are allowed. A subject is a string like:

repo:nobhri/azure-dbx-mock-platform:ref:refs/heads/main
repo:nobhri/azure-dbx-mock-platform:pull_request

The gotcha: pull_request events carry a different OIDC subject than push to main. If you configure your federated credential for refs/heads/main only, a PR-time terraform plan silently fails to authenticate. The workflow appears to run, but the plan is operating without valid Azure credentials.

This cost me a debugging session. The fix is to add a separate federated credential for the pull_request subject. A plan that silently runs without valid credentials produces output that looks correct but is meaningless — which makes it worse than a plan that fails loudly. It appears again in the failures section for exactly that reason.

Dual provider pattern for Databricks

The Databricks Terraform provider requires two separate configurations for Unity Catalog operations: account scope and workspace scope.

# infra/workload-dbx/providers.tf

# Account-scope: for Metastore creation and workspace assignment
provider "databricks" {
  alias           = "account"
  host            = "https://accounts.azuredatabricks.net"
  account_id      = var.databricks_account_id
  auth_type       = "azure-cli"
  azure_tenant_id = var.azure_tenant_id
}

# Workspace-scope: for Storage Credential and External Location
provider "databricks" {
  alias                       = "workspace"
  azure_workspace_resource_id = var.azure_workspace_resource_id
  auth_type                   = "azure-cli"
  azure_tenant_id             = var.azure_tenant_id
}

Each resource in workload-dbx explicitly declares which provider it uses. The metastore and its workspace assignment use databricks.account. Storage credentials and external locations use databricks.workspace. Mixing these up produces cryptic errors about missing permissions — the account-scope provider can't see workspace-scope resources, and vice versa.

State Isolation

Three separate tfstate files. Three separate blob containers. One dedicated Storage Account created by Bootstrap.

# infra/bootstrap/main.tf

# Storage account for tfstate
resource "azurerm_storage_account" "tfstate" {
  name                     = var.tfstate_sa_name
  resource_group_name      = azurerm_resource_group.tfstate.name
  location                 = azurerm_resource_group.tfstate.location
  account_tier             = "Standard"
  account_replication_type = "LRS"
  min_tls_version          = "TLS1_2"
}

# Separate containers per layer
resource "azurerm_storage_container" "guardrails" {
  name                  = var.guardrails_container
  storage_account_name  = azurerm_storage_account.tfstate.name
  container_access_type = "private"
}

resource "azurerm_storage_container" "workload" {
  name                  = var.workload_container
  storage_account_name  = azurerm_storage_account.tfstate.name
  container_access_type = "private"
}

Bootstrap itself uses local (ephemeral) state on the runner — a deliberate chicken-and-egg decision. You can't store Bootstrap's own state in a backend that doesn't exist yet. The trade-off: Bootstrap is not idempotent in the strict sense. Re-running it requires the storage resources to already exist or to be manually reconciled. This is acceptable because Bootstrap runs exactly once, manually, via workflow_dispatch.

Concurrency control

Terraform state uses blob leasing for locking. If a workflow is cancelled mid-apply, the lease can remain, blocking future runs. Every workflow has two safeguards:

# Workflow-level: prevents parallel runs of the same workflow
concurrency:
  group: tf-workload-azure
  cancel-in-progress: true

# Step-level: breaks stale leases before init
- name: Preflight — break stale lease if any
  run: |
    az storage blob show ... --query "properties.leaseState" -o tsv \
      | grep -q leased && az storage blob lease break ... || true

# And on cancellation:
- name: Break lease if cancelled
  if: cancelled()
  run: az storage blob lease break ...

The concurrency key prevents two runs of the same workflow from racing. The preflight step handles leases left by external processes or previous failed runs. Both are necessary.

The Pipeline, Layer by Layer

bootstrap.yaml — one-time, manual only

bootstrap.yaml has exactly one trigger: workflow_dispatch. No path filters, no push triggers. This workflow exists to be run once, by a human, deliberately. Adding an automatic trigger would be actively dangerous — re-bootstrapping the tfstate backend against live state is how you lose it.

Bootstrap uses local (ephemeral) state on the runner. This is the chicken-and-egg design: you can't store Bootstrap's own state in a remote backend that doesn't exist yet. The trade-off is that Bootstrap isn't re-runnable in the strict idempotent sense — acceptable because it runs exactly once.

guardrails.yaml — a non-obvious requirement in budget automation

The guardrails layer sets a subscription-level budget alert. The non-obvious detail: Azure Budget requires start_date to not be in the past. A static date baked into the workflow file would cause terraform apply to fail the moment that date passed. The fix is to calculate it at runtime:

- name: Set budget dates
  run: |
    echo "BUDGET_START=\((date -u +'%Y-%m-01T00:00:00Z')" >> \)GITHUB_ENV
    echo "BUDGET_END=\(((\)(date -u +'%Y') + 1))-01-01T00:00:00Z" >> $GITHUB_ENV

A small detail, but the kind that causes silent failures if you miss it.

workload-azure.yaml — three decisions that matter

The Azure layer provisions Resource Group, ADLS Gen2, Access Connector, and Databricks Workspace. Three decisions worth flagging:

is_hns_enabled = true is required for Unity Catalog. ADLS Gen2 with hierarchical namespace disabled looks identical from the outside but fails when UC tries to manage its root storage. This flag cannot be changed after the storage account is created — forgetting it means re-creating the account.

SystemAssigned Managed Identity on the Access Connector ties the identity lifecycle to the resource itself. Simpler than UserAssigned: no separate identity to manage, no RBAC assignments that outlive the connector.

RBAC in Terraform: the Storage Blob Data Contributor role assignment from Access Connector to the storage account is declared as a Terraform resource. No manual portal assignment, no undocumented permission that silently disappears on re-create.

workload-dbx.yaml — cross-layer state reference

The most architecturally interesting pattern in this pipeline: workload-dbx reads outputs from the workload-azure state file before running its own Terraform. This is how workspace resource IDs and storage account names flow between independent layers without being hardcoded or duplicated:

- name: Init workload-azure backend to read outputs
  run: terraform -chdir=infra/workload-azure init -backend-config=...

- name: Capture workload-azure outputs
  id: azout
  run: |
    echo "WORKSPACE_RESOURCE_ID=\((terraform -chdir=infra/workload-azure output -raw workspace_resource_id)" >> \)GITHUB_OUTPUT
    echo "ACCESS_CONNECTOR_ID=\((terraform -chdir=infra/workload-azure output -raw access_connector_id)" >> \)GITHUB_OUTPUT

Two independent state files, no shared variables file, no hardcoded values. The Databricks layer consumes what the Azure layer produced, with the state file as the contract between them.

One more thing visible in workload-dbx/main.tf: the catalog and schema resources are entirely commented out. This isn't an incomplete implementation — it's a deliberate boundary. Catalog/schema management belongs to Jinja2 + SQL (ADR-001), and the commented code is the explicit record of where that boundary was drawn.

What Broke (and What I Learned)

This is the section that differentiates a portfolio from a tutorial. Real systems fail in specific ways.

Variable mismatch hell

When workload-dbx was first connected to CI, it produced five simultaneous Terraform errors:

Error: Missing required argument
  on providers.tf line 23, in provider "databricks":
  23:   account_id = var.databricks_account_id

Error: Missing required argument
  ...

Five separate variables, all failing at once. The root cause: the -var flags in the workflow didn't match the variable names defined in variables.tf. Variable names had drifted during refactoring, but the workflow file wasn't updated in sync. Caught by code review, fixed in a single PR.

The lesson is not "be more careful." It's a CI design question: Terraform errors from misaligned -var flags all fail at plan time, not apply time. A CI pipeline that runs plan only on merge — not on PRs — turns what should be a code review comment into a production incident. This is why every workflow in this platform runs terraform plan on every pull request, not as documentation, but as the actual gate.

Hardcoded metastore UUID

In an early version of workload-dbx/main.tf, the storage root path included a hardcoded metastore UUID:

storage_root = "abfss://uc-root@mystorage.dfs.core.windows.net/a1b2c3d4-..."

This is a silent failure waiting to happen. The metastore UUID is environment-specific. In a multi-environment setup (dev/staging/prod), each environment would need a different UUID. Hardcoded means: works in one environment, silently wrong in others.

Fixed by making it a variable (var.metastore_id) passed via secrets.METASTORE_ID. The value is now explicit, environment-specific, and auditable. (A metastore UUID isn't genuinely sensitive — GitHub Variable would be the more precise choice over Secret. This was a convenience decision, not a security one.)

"Fixed" isn't fixed until it runs

The PR was merged. The issue was closed. The next CI run:

Error: Provider produced unexpected result
  The metastore storage root path is invalid or empty.

METASTORE_ID had been added as a GitHub Secret reference in the code — but the actual secret value was never populated in the repository settings. A non-existent secret silently resolves to an empty string at runtime, with no warning.

Closing a ticket is not evidence that the system works. The CI log is the only evidence.

This failure mode is specific to environment-specific configuration: the code change is correct, but the environment isn't. Unit tests and code review can't catch it. Only running the pipeline catches it.

Destroy order matters

Unity Catalog account-scope objects — specifically uc-mi-credential (Storage Credential) and uc-root-location (External Location) — survive Databricks workspace deletion. They are attached to the Databricks account, not to any specific workspace.

If you destroy workload-azure before workload-dbx, the workspace is gone but these UC objects remain. On the next workload-dbx apply, Terraform tries to create them again — and fails because they already exist in the account, in a partially-attached state.

Always destroy workload-dbx first. This is now documented in the README's Known Issues section. The correct sequence:

1. Destroy workload-dbx  (UC objects cleaned up cleanly)
2. Destroy workload-azure (Azure resources removed)
3. Destroy guardrails    (optional, budget alerts only)

Terraform state ≠ reality

The most humbling entry on this list.

During a destroy operation, terraform destroy on the metastore failed with:

Error: cannot destroy metastore: not empty: 1 catalog, 1 storage credential

But the catalog resource had been commented out of main.tf earlier and dropped from state. Was the catalog actually there? Was it an orphan from a previous apply before the commenting-out? Was the error message inaccurate about what was actually blocking the destroy?

I don't know. force_destroy = true on the metastore resource resolved it:

resource "databricks_metastore" "this" {
  force_destroy = true
  ...
}

But I cannot fully explain why the error occurred, because the post-cleanup state has no record of what the pre-cleanup state actually contained. Honest engineering means admitting when you can't confirm the root cause. What I know: force_destroy = true is now the default for metastores in environments where you expect to destroy and recreate. What I don't know: whether the catalog object was genuinely present or whether the Databricks API error message was misleading.

`inputs.destroy` type confusion

GitHub Actions workflow_dispatch boolean inputs should be real booleans — and at runtime, they are. But the condition syntax matters:

# This never fires — string comparison against a boolean value
if: github.ref == 'refs/heads/main' && inputs.destroy != 'true'

# This works
if: github.ref == 'refs/heads/main' && inputs.destroy != true

The destroy step condition went through three PRs before settling. Each fix introduced a new subtle breakage:

PR 1: Changed != 'true' to == false — but == false doesn't match when the input is absent
PR 2: Changed to != true — correct, but the destroy step condition needed the same treatment
PR 3: Aligned both conditions to use boolean comparison consistently

Three PRs for one checkbox. This is what "simple" looks like in practice.

Cost surprise

Terraform destroy failures + lingering resources = money. The budget alert caught it at ¥4,000-6,000/month from resources left running after a failed destroy sequence. The guardrails layer existed precisely for this scenario.

The budget didn't prevent the cost — it just made it visible. Prevention required fixing the destroy order, adding force_destroy, and confirming that CI actually completed cleanly after each run.

Design Decisions I'd Make Differently

Taskfile integration

The intent was to wrap all Terraform operations in go-task commands, creating a consistent interface between local development and CI. CI calls task apply:workload-azure, and the Taskfile translates that into the right terraform invocation with the right backend config.

In a multi-team setup, this abstraction layer has real value: a data engineer can run task apply:workload-azure without knowing the exact backend config flags Terraform expects. For a single-person MVP, that consistency layer is overhead without a team to be consistent for. I made the deliberate choice to keep CI workflows calling Terraform directly, and to defer Taskfile integration until the team size justifies the abstraction cost.

`terraform init -upgrade` on every run

The workload-dbx workflow includes -upgrade on terraform init:

terraform -chdir=infra/workload-dbx init -upgrade \
  -backend-config=...

This re-resolves provider versions on every CI run. Useful during active development (avoids stale providers), but it adds latency and creates implicit coupling to Terraform registry availability on every apply. It's still in the code as of this writing. The planned fix: remove -upgrade from the default init and expose it as an optional workflow_dispatch boolean input — so a deliberate provider bump remains possible without making it the default on every run.

Single workflow for plan + apply

Currently terraform plan (on PR) and terraform apply (on merge) are steps within the same workflow job. Separating them — plan in one job, apply in another requiring explicit approval — would create a cleaner review gate: engineers approve the plan before apply runs, not just the code change.

This is a low-priority improvement for a solo project. It would matter on a team.

Production Considerations

This mock platform intentionally omits several things a production deployment requires. Not because they're unknown, but because they add cost and complexity that doesn't serve the portfolio purpose.

No private networking. GitHub-hosted runners operate from dynamic IPs. The workspace allows public endpoint access. In production, you'd choose between a static IP + allowlist or a self-hosted runner inside the VNet. Neither is free. See the README's network isolation section for the full decision tree.

No IP allowlist on the Databricks workspace. Related to the above. Public endpoint + no allowlist = accessible from anywhere. Acceptable for a mock environment. Unacceptable for production data.

No multi-environment. A single workspace, a single catalog, no dev/staging/prod separation. The README's architecture diagram shows the target state (three platform workspaces + one consumer workspace). This series documents the MVP, not the target.

No monitoring beyond budget alerts. Azure Monitor, Databricks audit logs, Unity Catalog lineage — all absent. The budget alert was the minimum viable guardrail against runaway costs.

What's Next

Part 2: Catalog/Schema Management with Jinja2

ADR-001 says Terraform owns the metastore, and Jinja2 + SQL owns the catalog and schema. Part 2 explains what that looks like in practice: how you parametrize DDL for multiple environments without Terraform, and why the Terraform-managed catalog resources in this codebase are all commented out.

Part 3: Job Authoring with Asset Bundles

Databricks Asset Bundles are the tool the data engineering team uses to deploy jobs. Part 3 covers how they fit into the CI/CD pipeline established in Parts 1 and 2, and what "idempotent ETL" actually means in a Unity Catalog environment.

Repository: https://github.com/nobhri/azure-dbx-mock-platform/tree/blog/cicd-part1

Nobuaki Hirai — Data Platform Architect / Data Engineer

Why I Built a Mock Data Platform — and What the CI/CD Taught Me

The Architecture in One Diagram

Why four layers and not one?

Authentication: Zero Stored Credentials

The OIDC subject gotcha

Dual provider pattern for Databricks

State Isolation

Concurrency control

The Pipeline, Layer by Layer

bootstrap.yaml — one-time, manual only

guardrails.yaml — a non-obvious requirement in budget automation

workload-azure.yaml — three decisions that matter

workload-dbx.yaml — cross-layer state reference

What Broke (and What I Learned)

Variable mismatch hell

Hardcoded metastore UUID

"Fixed" isn't fixed until it runs

Destroy order matters

Terraform state ≠ reality

`inputs.destroy` type confusion

Cost surprise

Design Decisions I'd Make Differently

Taskfile integration

`terraform init -upgrade` on every run

Single workflow for plan + apply

Production Considerations

What's Next

Comments

Terraform & Databricks CI/CD

Terraform Stops at the Metastore: Managing Unity Catalog with Jinja2 and CI/CD

More from this blog

Terraform & Databricks CI/CD Part 3: The Design Decisions Behind the Job Layer

Terraform Stops at the Metastore: Managing Unity Catalog with Jinja2 and CI/CD

Command Palette

The Architecture in One Diagram

Why four layers and not one?

Authentication: Zero Stored Credentials

The OIDC subject gotcha

Dual provider pattern for Databricks

State Isolation

Concurrency control

The Pipeline, Layer by Layer

bootstrap.yaml — one-time, manual only

guardrails.yaml — a non-obvious requirement in budget automation

workload-azure.yaml — three decisions that matter

workload-dbx.yaml — cross-layer state reference

What Broke (and What I Learned)

Variable mismatch hell

Hardcoded metastore UUID

"Fixed" isn't fixed until it runs

Destroy order matters

Terraform state ≠ reality

inputs.destroy type confusion

Cost surprise

Design Decisions I'd Make Differently

Taskfile integration

terraform init -upgrade on every run

Single workflow for plan + apply

Production Considerations

What's Next

Comments

Terraform & Databricks CI/CD

Terraform Stops at the Metastore: Managing Unity Catalog with Jinja2 and CI/CD

More from this blog

`inputs.destroy` type confusion

`terraform init -upgrade` on every run