Why I Built a Mock Data Platform — and What the CI/CD Taught Me
A design journal for Terraform + Databricks + GitHub Actions: Part 1 — Workspace & Metastore
There is a gap between "I know Terraform" and "I've designed a platform." The first is a skill. The second is a pattern of thinking — about ownership boundaries, failure modes, team capabilities, and the decisions you have to make before you write a single line of HCL.
I built azure-dbx-mock-platform to close that gap in my own portfolio. Not as a tutorial, not as a getting-started guide. As a design journal: a record of the architectural decisions I made, the alternatives I rejected, and what broke along the way.
This is Part 1 of a 3-part series. It covers the infrastructure foundation: Azure provisioning, Unity Catalog Metastore, and the CI/CD pipeline that ties them together. Parts 2 and 3 will cover catalog/schema management with Jinja2 and job authoring with Asset Bundles.
The Architecture in One Diagram
The platform is organized into four independent layers, each with its own Terraform state, its own workflow, and its own blast radius:
Bootstrap
└── creates the tfstate backend (Storage Account + containers)
Guardrails
└── subscription-level budget alert
Workload-Azure
└── Resource Group, ADLS Gen2, Access Connector, Databricks Workspace
Workload-Databricks
└── Unity Catalog Metastore, Storage Credential, External Location
Within each workspace, the layer separation continues:
+------------------------------------------+
| Azure Layer (Terraform) |
| VNet · Storage · RBAC · Workspace |
+------------------------------------------+
| Databricks Account Layer (Terraform) |
| Metastore · Storage Credential |
+------------------------------------------+
| Catalog / Schema Layer (Jinja2 + SQL) |
| Environment-parametrized DDL |
+------------------------------------------+
| Job / Workflow Layer (Asset Bundles) |
| Idempotent ETL jobs |
+------------------------------------------+
Each layer has a different rate of change, different team ownership, and different failure blast radius. That last point is the one most people underweight.
Why four layers and not one?
A single Terraform workspace managing everything from the VNet to the catalog schemas creates hidden coupling that eventually breaks — usually in production. When a data engineer needs to add a schema, they shouldn't be touching the same state file as the VNet. When a destroy operation fails on a catalog object, it shouldn't put the entire Azure infrastructure in a locked state.
The four-layer separation enforces this at the tool level. A bug in workload-dbx can't corrupt workload-azure state. A destroy of the Databricks layer is a separate, intentional operation from destroying Azure resources. Blast radius is bounded by design, not by convention.
Terraform owns infra and metastore; catalog and schema are deliberately delegated to Jinja2 + SQL. The reason: a data engineer adding a schema shouldn't require Terraform expertise or infra team review. Tool ownership follows team ownership — the principle behind ADR-001, and the central theme of Part 2.
The production gaps this architecture knowingly accepts — no private networking, no multi-environment separation, no IP allowlist — are documented in Production Considerations below.
Authentication: Zero Stored Credentials
The single most important decision in this platform's CI/CD design: no stored secrets anywhere.
Every GitHub Actions workflow authenticates to Azure via OIDC federated identity. No service principal passwords. No client secrets. No credentials that rotate, leak, or get forgotten in .env files.
How it works:
GitHub Actions runner
→ presents OIDC token (signed by GitHub, scoped to this repo + branch)
→ Azure Entra ID validates the token against a configured federated credential
→ issues a short-lived access token for the Service Principal
→ Terraform uses this token via the azurerm provider
In the workflow file, it looks like this:
# .github/workflows/workload-azure.yaml
permissions:
id-token: write # required for OIDC
contents: read
jobs:
tf:
steps:
- name: Azure login (OIDC)
uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
AZURE_CLIENT_ID is not a secret in the traditional sense — it's a non-sensitive identifier. The actual authentication is cryptographic, not credential-based.
ADR-002: Why not Service Principal secrets?
Secrets rotate. They get forgotten in GitHub Actions environment variables set months ago. They leak in CI logs when someone adds echo $ARM_CLIENT_SECRET for debugging. OIDC has no secret to manage, integrates natively with Entra ID, and scope is limited to the specific workflow and branch combination.
The OIDC subject gotcha
OIDC federated credentials in Azure require you to explicitly configure which GitHub Actions subjects are allowed. A subject is a string like:
repo:nobhri/azure-dbx-mock-platform:ref:refs/heads/main
repo:nobhri/azure-dbx-mock-platform:pull_request
The gotcha: pull_request events carry a different OIDC subject than push to main. If you configure your federated credential for refs/heads/main only, a PR-time terraform plan silently fails to authenticate. The workflow appears to run, but the plan is operating without valid Azure credentials.
This cost me a debugging session. The fix is to add a separate federated credential for the pull_request subject. A plan that silently runs without valid credentials produces output that looks correct but is meaningless — which makes it worse than a plan that fails loudly. It appears again in the failures section for exactly that reason.
Dual provider pattern for Databricks
The Databricks Terraform provider requires two separate configurations for Unity Catalog operations: account scope and workspace scope.
# infra/workload-dbx/providers.tf
# Account-scope: for Metastore creation and workspace assignment
provider "databricks" {
alias = "account"
host = "https://accounts.azuredatabricks.net"
account_id = var.databricks_account_id
auth_type = "azure-cli"
azure_tenant_id = var.azure_tenant_id
}
# Workspace-scope: for Storage Credential and External Location
provider "databricks" {
alias = "workspace"
azure_workspace_resource_id = var.azure_workspace_resource_id
auth_type = "azure-cli"
azure_tenant_id = var.azure_tenant_id
}
Each resource in workload-dbx explicitly declares which provider it uses. The metastore and its workspace assignment use databricks.account. Storage credentials and external locations use databricks.workspace. Mixing these up produces cryptic errors about missing permissions — the account-scope provider can't see workspace-scope resources, and vice versa.
State Isolation
Three separate tfstate files. Three separate blob containers. One dedicated Storage Account created by Bootstrap.
# infra/bootstrap/main.tf
# Storage account for tfstate
resource "azurerm_storage_account" "tfstate" {
name = var.tfstate_sa_name
resource_group_name = azurerm_resource_group.tfstate.name
location = azurerm_resource_group.tfstate.location
account_tier = "Standard"
account_replication_type = "LRS"
min_tls_version = "TLS1_2"
}
# Separate containers per layer
resource "azurerm_storage_container" "guardrails" {
name = var.guardrails_container
storage_account_name = azurerm_storage_account.tfstate.name
container_access_type = "private"
}
resource "azurerm_storage_container" "workload" {
name = var.workload_container
storage_account_name = azurerm_storage_account.tfstate.name
container_access_type = "private"
}
Bootstrap itself uses local (ephemeral) state on the runner — a deliberate chicken-and-egg decision. You can't store Bootstrap's own state in a backend that doesn't exist yet. The trade-off: Bootstrap is not idempotent in the strict sense. Re-running it requires the storage resources to already exist or to be manually reconciled. This is acceptable because Bootstrap runs exactly once, manually, via workflow_dispatch.
Concurrency control
Terraform state uses blob leasing for locking. If a workflow is cancelled mid-apply, the lease can remain, blocking future runs. Every workflow has two safeguards:
# Workflow-level: prevents parallel runs of the same workflow
concurrency:
group: tf-workload-azure
cancel-in-progress: true
# Step-level: breaks stale leases before init
- name: Preflight — break stale lease if any
run: |
az storage blob show ... --query "properties.leaseState" -o tsv \
| grep -q leased && az storage blob lease break ... || true
# And on cancellation:
- name: Break lease if cancelled
if: cancelled()
run: az storage blob lease break ...
The concurrency key prevents two runs of the same workflow from racing. The preflight step handles leases left by external processes or previous failed runs. Both are necessary.
The Pipeline, Layer by Layer
bootstrap.yaml — one-time, manual only
bootstrap.yaml has exactly one trigger: workflow_dispatch. No path filters, no push triggers. This workflow exists to be run once, by a human, deliberately. Adding an automatic trigger would be actively dangerous — re-bootstrapping the tfstate backend against live state is how you lose it.
Bootstrap uses local (ephemeral) state on the runner. This is the chicken-and-egg design: you can't store Bootstrap's own state in a remote backend that doesn't exist yet. The trade-off is that Bootstrap isn't re-runnable in the strict idempotent sense — acceptable because it runs exactly once.
guardrails.yaml — a non-obvious requirement in budget automation
The guardrails layer sets a subscription-level budget alert. The non-obvious detail: Azure Budget requires start_date to not be in the past. A static date baked into the workflow file would cause terraform apply to fail the moment that date passed. The fix is to calculate it at runtime:
- name: Set budget dates
run: |
echo "BUDGET_START=\((date -u +'%Y-%m-01T00:00:00Z')" >> \)GITHUB_ENV
echo "BUDGET_END=\(((\)(date -u +'%Y') + 1))-01-01T00:00:00Z" >> $GITHUB_ENV
A small detail, but the kind that causes silent failures if you miss it.
workload-azure.yaml — three decisions that matter
The Azure layer provisions Resource Group, ADLS Gen2, Access Connector, and Databricks Workspace. Three decisions worth flagging:
is_hns_enabled = true is required for Unity Catalog. ADLS Gen2 with hierarchical namespace disabled looks identical from the outside but fails when UC tries to manage its root storage. This flag cannot be changed after the storage account is created — forgetting it means re-creating the account.
SystemAssigned Managed Identity on the Access Connector ties the identity lifecycle to the resource itself. Simpler than UserAssigned: no separate identity to manage, no RBAC assignments that outlive the connector.
RBAC in Terraform: the Storage Blob Data Contributor role assignment from Access Connector to the storage account is declared as a Terraform resource. No manual portal assignment, no undocumented permission that silently disappears on re-create.
workload-dbx.yaml — cross-layer state reference
The most architecturally interesting pattern in this pipeline: workload-dbx reads outputs from the workload-azure state file before running its own Terraform. This is how workspace resource IDs and storage account names flow between independent layers without being hardcoded or duplicated:
- name: Init workload-azure backend to read outputs
run: terraform -chdir=infra/workload-azure init -backend-config=...
- name: Capture workload-azure outputs
id: azout
run: |
echo "WORKSPACE_RESOURCE_ID=\((terraform -chdir=infra/workload-azure output -raw workspace_resource_id)" >> \)GITHUB_OUTPUT
echo "ACCESS_CONNECTOR_ID=\((terraform -chdir=infra/workload-azure output -raw access_connector_id)" >> \)GITHUB_OUTPUT
Two independent state files, no shared variables file, no hardcoded values. The Databricks layer consumes what the Azure layer produced, with the state file as the contract between them.
One more thing visible in workload-dbx/main.tf: the catalog and schema resources are entirely commented out. This isn't an incomplete implementation — it's a deliberate boundary. Catalog/schema management belongs to Jinja2 + SQL (ADR-001), and the commented code is the explicit record of where that boundary was drawn.
What Broke (and What I Learned)
This is the section that differentiates a portfolio from a tutorial. Real systems fail in specific ways.
Variable mismatch hell
When workload-dbx was first connected to CI, it produced five simultaneous Terraform errors:
Error: Missing required argument
on providers.tf line 23, in provider "databricks":
23: account_id = var.databricks_account_id
Error: Missing required argument
...
Five separate variables, all failing at once. The root cause: the -var flags in the workflow didn't match the variable names defined in variables.tf. Variable names had drifted during refactoring, but the workflow file wasn't updated in sync. Caught by code review, fixed in a single PR.
The lesson is not "be more careful." It's a CI design question: Terraform errors from misaligned -var flags all fail at plan time, not apply time. A CI pipeline that runs plan only on merge — not on PRs — turns what should be a code review comment into a production incident. This is why every workflow in this platform runs terraform plan on every pull request, not as documentation, but as the actual gate.
Hardcoded metastore UUID
In an early version of workload-dbx/main.tf, the storage root path included a hardcoded metastore UUID:
storage_root = "abfss://uc-root@mystorage.dfs.core.windows.net/a1b2c3d4-..."
This is a silent failure waiting to happen. The metastore UUID is environment-specific. In a multi-environment setup (dev/staging/prod), each environment would need a different UUID. Hardcoded means: works in one environment, silently wrong in others.
Fixed by making it a variable (var.metastore_id) passed via secrets.METASTORE_ID. The value is now explicit, environment-specific, and auditable. (A metastore UUID isn't genuinely sensitive — GitHub Variable would be the more precise choice over Secret. This was a convenience decision, not a security one.)
"Fixed" isn't fixed until it runs
The PR was merged. The issue was closed. The next CI run:
Error: Provider produced unexpected result
The metastore storage root path is invalid or empty.
METASTORE_ID had been added as a GitHub Secret reference in the code — but the actual secret value was never populated in the repository settings. A non-existent secret silently resolves to an empty string at runtime, with no warning.
Closing a ticket is not evidence that the system works. The CI log is the only evidence.
This failure mode is specific to environment-specific configuration: the code change is correct, but the environment isn't. Unit tests and code review can't catch it. Only running the pipeline catches it.
Destroy order matters
Unity Catalog account-scope objects — specifically uc-mi-credential (Storage Credential) and uc-root-location (External Location) — survive Databricks workspace deletion. They are attached to the Databricks account, not to any specific workspace.
If you destroy workload-azure before workload-dbx, the workspace is gone but these UC objects remain. On the next workload-dbx apply, Terraform tries to create them again — and fails because they already exist in the account, in a partially-attached state.
Always destroy workload-dbx first. This is now documented in the README's Known Issues section. The correct sequence:
1. Destroy workload-dbx (UC objects cleaned up cleanly)
2. Destroy workload-azure (Azure resources removed)
3. Destroy guardrails (optional, budget alerts only)
Terraform state ≠ reality
The most humbling entry on this list.
During a destroy operation, terraform destroy on the metastore failed with:
Error: cannot destroy metastore: not empty: 1 catalog, 1 storage credential
But the catalog resource had been commented out of main.tf earlier and dropped from state. Was the catalog actually there? Was it an orphan from a previous apply before the commenting-out? Was the error message inaccurate about what was actually blocking the destroy?
I don't know. force_destroy = true on the metastore resource resolved it:
resource "databricks_metastore" "this" {
force_destroy = true
...
}
But I cannot fully explain why the error occurred, because the post-cleanup state has no record of what the pre-cleanup state actually contained. Honest engineering means admitting when you can't confirm the root cause. What I know: force_destroy = true is now the default for metastores in environments where you expect to destroy and recreate. What I don't know: whether the catalog object was genuinely present or whether the Databricks API error message was misleading.
inputs.destroy type confusion
GitHub Actions workflow_dispatch boolean inputs should be real booleans — and at runtime, they are. But the condition syntax matters:
# This never fires — string comparison against a boolean value
if: github.ref == 'refs/heads/main' && inputs.destroy != 'true'
# This works
if: github.ref == 'refs/heads/main' && inputs.destroy != true
The destroy step condition went through three PRs before settling. Each fix introduced a new subtle breakage:
PR 1: Changed
!= 'true'to== false— but== falsedoesn't match when the input is absentPR 2: Changed to
!= true— correct, but the destroy step condition needed the same treatmentPR 3: Aligned both conditions to use boolean comparison consistently
Three PRs for one checkbox. This is what "simple" looks like in practice.
Cost surprise
Terraform destroy failures + lingering resources = money. The budget alert caught it at ¥4,000-6,000/month from resources left running after a failed destroy sequence. The guardrails layer existed precisely for this scenario.
The budget didn't prevent the cost — it just made it visible. Prevention required fixing the destroy order, adding force_destroy, and confirming that CI actually completed cleanly after each run.
Design Decisions I'd Make Differently
Taskfile integration
The intent was to wrap all Terraform operations in go-task commands, creating a consistent interface between local development and CI. CI calls task apply:workload-azure, and the Taskfile translates that into the right terraform invocation with the right backend config.
In a multi-team setup, this abstraction layer has real value: a data engineer can run task apply:workload-azure without knowing the exact backend config flags Terraform expects. For a single-person MVP, that consistency layer is overhead without a team to be consistent for. I made the deliberate choice to keep CI workflows calling Terraform directly, and to defer Taskfile integration until the team size justifies the abstraction cost.
terraform init -upgrade on every run
The workload-dbx workflow includes -upgrade on terraform init:
terraform -chdir=infra/workload-dbx init -upgrade \
-backend-config=...
This re-resolves provider versions on every CI run. Useful during active development (avoids stale providers), but it adds latency and creates implicit coupling to Terraform registry availability on every apply. It's still in the code as of this writing. The planned fix: remove -upgrade from the default init and expose it as an optional workflow_dispatch boolean input — so a deliberate provider bump remains possible without making it the default on every run.
Single workflow for plan + apply
Currently terraform plan (on PR) and terraform apply (on merge) are steps within the same workflow job. Separating them — plan in one job, apply in another requiring explicit approval — would create a cleaner review gate: engineers approve the plan before apply runs, not just the code change.
This is a low-priority improvement for a solo project. It would matter on a team.
Production Considerations
This mock platform intentionally omits several things a production deployment requires. Not because they're unknown, but because they add cost and complexity that doesn't serve the portfolio purpose.
No private networking. GitHub-hosted runners operate from dynamic IPs. The workspace allows public endpoint access. In production, you'd choose between a static IP + allowlist or a self-hosted runner inside the VNet. Neither is free. See the README's network isolation section for the full decision tree.
No IP allowlist on the Databricks workspace. Related to the above. Public endpoint + no allowlist = accessible from anywhere. Acceptable for a mock environment. Unacceptable for production data.
No multi-environment. A single workspace, a single catalog, no dev/staging/prod separation. The README's architecture diagram shows the target state (three platform workspaces + one consumer workspace). This series documents the MVP, not the target.
No monitoring beyond budget alerts. Azure Monitor, Databricks audit logs, Unity Catalog lineage — all absent. The budget alert was the minimum viable guardrail against runaway costs.
What's Next
Part 2: Catalog/Schema Management with Jinja2
ADR-001 says Terraform owns the metastore, and Jinja2 + SQL owns the catalog and schema. Part 2 explains what that looks like in practice: how you parametrize DDL for multiple environments without Terraform, and why the Terraform-managed catalog resources in this codebase are all commented out.
Part 3: Job Authoring with Asset Bundles
Databricks Asset Bundles are the tool the data engineering team uses to deploy jobs. Part 3 covers how they fit into the CI/CD pipeline established in Parts 1 and 2, and what "idempotent ETL" actually means in a Unity Catalog environment.
Repository: https://github.com/nobhri/azure-dbx-mock-platform/tree/blog/cicd-part1
Nobuaki Hirai — Data Platform Architect / Data Engineer