deep·tech·intuition
intermediate ·

Terraform Deep Intuition

An experienced engineer's guide to Terraform

1. One-Sentence Essence

Terraform is a state-machine that drives a set of cloud APIs toward a declared target — and the state file is the machine.

Almost everything that delights and infuriates you about Terraform follows from that one sentence. The .tfstate file is not a cache, not an optimization, not a “performance feature.” It is the entire memory of the system. Terraform’s view of your cloud is whatever is written in that JSON file. Your .tf configuration is the desired state. Your cloud account is the real state. The state file is what Terraform thinks the real state is. Plans, applies, drifts, deletions, apparent miracles, and 3am incidents all live in the gap between those three.

If you remember nothing else, remember this: Terraform doesn’t see your cloud. It sees its state file, refreshed against the cloud API by what the providers chose to look at. Everything in this document is a footnote to that line.


2. The Problem It Solved

Before infrastructure-as-code, “managing infrastructure” looked like one of two things, both bad.

The first was the AWS Console (or the equivalent for Azure, GCP, vSphere, whatever). Someone — usually the most experienced engineer, often at 11pm on a Friday — would click their way through a dozen forms to spin up a VPC, three subnets, a load balancer, an autoscaling group, and the IAM roles to glue them together. The infrastructure existed. Nobody could reproduce it. When that engineer left, the institutional knowledge of “why is this security group named bobs-test-pls-dont-delete” left with them. Staging didn’t match production. Disaster recovery was a religion, not a procedure.

The second was the previous generation of automation: Chef, Puppet, Ansible. These tools were configuration management, not provisioning. They were great at “make this Linux box look like that Linux box” — installing packages, managing users, dropping config files. They were terrible at “create a Linux box, attach it to this load balancer, give it this IAM role, route DNS to it.” They expected the box to already exist. They worked at the wrong altitude.

A handful of provisioning tools existed (CloudFormation came out in 2011, AWS-only, YAML-only, painful), but nothing cloud-agnostic, nothing that gave you a unified language for AWS, Azure, GCP, Cloudflare, Datadog, PagerDuty, and your internal vault. HashiCorp released Terraform in 2014 with a deceptively boring pitch: write down what you want, in one language, and we’ll figure out the API calls to get there. Across any cloud.

The key insight wasn’t the language. It was the execution plan. Terraform showed you, in advance, what it was about to do. That single feature — terraform plan — is more responsible for Terraform’s dominance than HCL, the provider ecosystem, or HashiCorp’s marketing combined. For the first time, an engineer could say “I’m going to change production” and back it up with a diff that another engineer could review. It made cloud infrastructure look like code. Code review, version control, pull requests — the whole software-engineering hygiene story suddenly applied to your VPCs.

That’s the thing Terraform solved. Not “automation” — automation existed. Reviewable, declarative, multi-cloud provisioning with a preview step before you press the button.


3. The Concepts You Need

Terraform has its own vocabulary. You cannot reason about it — or read its error messages, or understand a single Stack Overflow answer — without these terms. Learn them once, properly, and the rest of the document will land.

The Code

  • HCL (HashiCorp Configuration Language) — the DSL Terraform configurations are written in. JSON-compatible, intentionally limited. Looks like a structured key-value language with blocks, expressions, and functions. Not a real programming language: no classes, no real loops in the procedural sense, no exceptions. We’ll come back to this.
  • .tf file — a file written in HCL. Terraform loads every .tf file in the current directory and merges them into one configuration. The filename is convention only; main.tf, network.tf, xyzzy.tf are all the same to Terraform. Subdirectories are not loaded automatically — they have to be brought in as modules.
  • Resource — the central noun. A declaration that says “I want one of these things to exist.” resource "aws_instance" "web" { ... } declares one EC2 instance. The resource has a type (aws_instance) and a name (web). The combined address — aws_instance.web — is how you refer to it elsewhere in code and how Terraform tracks it in state.
  • Data source — a read-only lookup. data "aws_ami" "latest" { ... } asks the cloud for the latest AMI matching some filter. Data sources don’t create anything; they retrieve information so other resources can use it.
  • Variable — an input parameter to a configuration or module. Declared with variable "foo" {}. Set via terraform.tfvars, -var flag, environment variables (TF_VAR_foo), or defaults.
  • Output — a return value from a configuration or module. Declared with output "bar" {}. The way modules expose results to their callers, and the way you extract values for humans or scripts.
  • Local value — a named expression, scoped to one configuration. Declared with locals { ... }. Like a variable in a programming language; not an input.

The Plumbing

  • Provider — a plugin that knows how to talk to one external API. The AWS provider talks to AWS, the Kubernetes provider talks to a Kubernetes API server, the Datadog provider talks to Datadog. Providers are separate binaries downloaded by terraform init. There are 4,800+ of them. The provider, not Terraform Core, knows what an aws_instance is.
  • Terraform Core — the binary you run as terraform. It parses HCL, builds the dependency graph, manages state, and orchestrates calls to providers via gRPC. Core is cloud-agnostic by design; it knows nothing about AWS or Azure.
  • Module — a collection of .tf files in a directory, treated as a reusable unit. The directory you run terraform in is the root module. Modules can call other modules. Modules are how you avoid copy-pasting the same VPC code five times.
  • Backend — where Terraform stores its state. The default is local (a terraform.tfstate file in your working directory). Production-grade backends include s3 (with DynamoDB for locking), gcs, azurerm, remote (HCP Terraform), and others.
  • State / state file / terraform.tfstate — the JSON document tracking everything Terraform manages. Contains every resource’s address (e.g. aws_instance.web), the real cloud ID it maps to (e.g. i-0abc123...), and a snapshot of all known attributes. The most dangerous file in your repo, except you should never commit it to your repo.
  • State lock — a mutex on the state file. Prevents two engineers (or two CI runs) from applying simultaneously and racing each other into corruption.
  • Workspace — a way to maintain multiple state files for the same configuration. Often misused. We’ll discuss this in detail.

The Operations

  • terraform init — downloads providers, sets up the backend, fetches modules. The first command in any project; safe to re-run.
  • terraform plan — refreshes state against the cloud, compares to configuration, and prints what it intends to do. The single most important command in the system.
  • terraform apply — executes the plan. Will re-run the plan and ask for confirmation unless given a saved plan file.
  • terraform destroy — deletes everything in state. Useful in dev/test; terrifying in prod.
  • terraform refresh (deprecated; now apply -refresh-only) — re-syncs the state file with the actual cloud, without applying any configuration changes.
  • terraform import — pulls an existing cloud resource into state. The escape hatch for “we created this thing manually and now Terraform needs to know about it.”
  • Plan / apply cycle — the core workflow. You write code, you plan, you read the plan carefully, you apply. Doing anything else (especially skipping the read) is how production breaks.

The Building Blocks

  • count — a meta-argument that creates N instances of a resource. count = 3 creates three. Indexed numerically: aws_instance.web[0], [1], [2]. We’ll see later why this is a footgun.
  • for_each — a meta-argument that creates one instance per key in a map or set. Indexed by key: aws_instance.web["api"], ["worker"]. Almost always preferable to count.
  • depends_on — explicit dependency. Forces Terraform to wait for resource A before doing anything with resource B. A last resort; usually dependencies are inferred from references.
  • lifecycle — meta-block that controls how a resource is replaced or destroyed. Includes create_before_destroy, prevent_destroy, ignore_changes, replace_triggered_by. The serious knobs.
  • moved block — declares “this resource used to have a different address.” Terraform updates state without destroying the resource. Introduced in 1.1; the safe way to refactor.
  • removed block — declares “stop managing this resource.” Optionally destroys it, optionally leaves it alone. Introduced in 1.7.
  • import block — declarative version of terraform import. Introduced in 1.5. With -generate-config-out it can scaffold the HCL for you.

The State of the World (2026)

  • HCP Terraform — HashiCorp’s managed service (formerly “Terraform Cloud”). Hosts state, runs apply, enforces policy, provides a UI.
  • OpenTofu — the community fork of Terraform, maintained by the Linux Foundation. Forked from Terraform 1.5.6 after HashiCorp re-licensed under BSL in 2023. As of 2026, fully production-ready, drop-in compatible for the vast majority of configurations. Has shipped features Terraform doesn’t have (native state encryption, provider iteration). When this document says “Terraform,” everything applies to OpenTofu unless noted.

If any of those terms still feel fuzzy, re-read this section. The rest of the document is built on it.


4. The Distilled Introduction

This is the section that replaces the tutorials. Setup through real-world workflow, in order, with explanation of why each piece exists, not just what to type.

Setup

Install Terraform (or OpenTofu — tofu is a drop-in replacement for the terraform binary). On macOS: brew install terraform. On Linux: download the binary from the official site or use your package manager. On Windows: just use WSL. There is no daemon, no service, no agent — Terraform is one statically-linked Go binary that you run from your shell. That’s it.

You’ll need credentials for the cloud you’re targeting. Terraform doesn’t manage credentials; the provider does, using whatever mechanism that cloud uses. For AWS, that’s ~/.aws/credentials, environment variables (AWS_ACCESS_KEY_ID etc.), an EC2 instance profile, or — if you’re being grown-up about it — short-lived SSO/OIDC credentials. The Terraform AWS provider reads them the same way the AWS CLI does.

Make a directory. Create a file called main.tf. That’s a Terraform project.

The first apply

# main.tf
terraform {
  required_version = ">= 1.5"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "eu-west-1"
}

resource "aws_s3_bucket" "logs" {
  bucket = "my-app-logs-9f8e7d6c"
}

That’s a complete Terraform configuration. Walk through what each block does:

The terraform block declares constraints on Terraform itself: minimum version, required providers, and (later) the backend configuration. Always pin provider versions. If you don’t, terraform init will pull whatever’s latest, and a year from now your CI will break for no reason you can understand. ~> 5.0 means “anything ≥ 5.0 and < 6.0” — patch and minor updates fine, no major upgrades. This single line of paranoia has saved more pipelines than any other Terraform best practice.

The provider block configures the AWS provider — region, optionally profile, assume-role config, default tags, and so on. You can have multiple provider blocks (multi-region, multi-account) using alias.

The resource block declares one S3 bucket. aws_s3_bucket is the type (defined by the AWS provider). logs is the local name (only meaningful inside this configuration). bucket = "my-app-logs-9f8e7d6c" is the actual cloud-side name. S3 bucket names are globally unique, so you’ll need your own.

Now run, in order:

terraform init    # Download the AWS provider, set up the backend.
terraform plan    # Show what would happen. Read this carefully.
terraform apply   # Re-show the plan, ask for "yes", then create the bucket.

The plan will say Plan: 1 to add, 0 to change, 0 to destroy. Apply prints a + next to every attribute it intends to set. After apply, a file called terraform.tfstate appears next to your main.tf. Open it. It’s JSON. You can see your bucket in there with its real ID and every attribute Terraform knows about.

If you run terraform plan again, it’ll say No changes. That’s idempotency: Terraform compared its state to your configuration, found them in sync, and refused to do anything. Run terraform apply and it does nothing. Run it ten times. Same result. This is the property that makes IaC valuable.

Adding more resources, and references

Add an instance:

resource "aws_instance" "web" {
  ami           = "ami-0a1b2c3d4e5f67890"
  instance_type = "t3.micro"

  tags = {
    Name        = "web"
    LogsBucket  = aws_s3_bucket.logs.id
  }
}

The interesting line is aws_s3_bucket.logs.id. That’s a reference — “use the id attribute of the resource at address aws_s3_bucket.logs.” Terraform parses this and infers a dependency: the bucket must be created before the instance. You don’t write depends_on. You let the dataflow declare it.

This is how you wire infrastructure together. Subnets reference VPCs. Instances reference subnets. Load balancers reference target groups. Security group rules reference security groups. Almost every real-world Terraform graph is a web of resource attribute references that look exactly like the dependency graph an experienced engineer would draw on a whiteboard. We’ll see later (Section 5) why this is the foundational trick.

Variables, locals, outputs

Hardcoding the AMI is wrong. Hardcoding anything environment-specific is wrong. Real configurations parameterize:

variable "environment" {
  description = "deployment env (dev, staging, prod)"
  type        = string
  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "environment must be dev, staging, or prod."
  }
}

variable "instance_type" {
  type    = string
  default = "t3.micro"
}

locals {
  name_prefix = "myapp-${var.environment}"
  common_tags = {
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

data "aws_ami" "amazon_linux" {
  most_recent = true
  owners      = ["amazon"]
  filter {
    name   = "name"
    values = ["al2023-ami-*-x86_64"]
  }
}

resource "aws_instance" "web" {
  ami           = data.aws_ami.amazon_linux.id
  instance_type = var.instance_type
  tags = merge(local.common_tags, { Name = "${local.name_prefix}-web" })
}

output "web_public_ip" {
  value = aws_instance.web.public_ip
}

variable declarations expose inputs. locals are computed values used inside this configuration only. data looks something up. output declares values to surface after apply (or expose to a parent module). The functions merge() and string interpolation "${...}" are HCL built-ins. var.environment, local.name_prefix, data.aws_ami.amazon_linux.id, aws_instance.web.public_ip are address references.

Set the variable from the command line: terraform apply -var environment=dev. Or from a terraform.tfvars file:

# terraform.tfvars
environment   = "dev"
instance_type = "t3.small"

Or from environment variables: TF_VAR_environment=dev terraform apply. Or per-environment files: terraform apply -var-file=prod.tfvars. Pick one convention and stick to it.

Loops: count and for_each

You almost never want to hand-write three identical resources. Use for_each:

variable "users" {
  type = map(object({
    role = string
  }))
  default = {
    alice = { role = "admin" }
    bob   = { role = "developer" }
    carol = { role = "developer" }
  }
}

resource "aws_iam_user" "this" {
  for_each = var.users
  name     = each.key
  tags     = { Role = each.value.role }
}

for_each iterates over a map (or a set of strings) and creates one resource per key. Inside the block, each.key and each.value reference the current iteration. The resulting addresses are aws_iam_user.this["alice"], aws_iam_user.this["bob"], etc.

count is the older, simpler alternative:

resource "aws_instance" "worker" {
  count         = 3
  ami           = data.aws_ami.amazon_linux.id
  instance_type = "t3.micro"
  tags          = { Name = "worker-${count.index}" }
}

Addresses are numeric: aws_instance.worker[0], [1], [2]. Almost always prefer for_each to count. The reason will get its own gotcha section (Section 7), but the short version: removing an item from the middle of a count list shifts every subsequent index, and Terraform interprets that as “destroy and recreate every shifted resource.” for_each keys by name, so removing bob from the map only destroys bob.

The one place count shines: conditional creation. count = var.create_thing ? 1 : 0 is the idiomatic way to say “only make this if a flag is set.”

Modules

When the same five resources appear in three configurations, extract them. A module is a directory of .tf files with variable declarations as inputs and output declarations as outputs. You call it like this:

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.5.1"

  name = "main"
  cidr = "10.0.0.0/16"

  azs             = ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
}

resource "aws_instance" "web" {
  subnet_id = module.vpc.private_subnets[0]
  # ...
}

The source can be a local path (./modules/vpc), a git URL (git::https://github.com/foo/bar.git//modules/vpc?ref=v1.2.3), or a registry reference like above. Pin the version. Always. The terraform-aws-modules collection on the public registry is excellent for common AWS patterns; you can build a working VPC in 20 lines instead of 200.

References across modules go via outputs: module.vpc.private_subnets[0] reads an output declared in the VPC module.

Remote state and team workflow

The local terraform.tfstate file is fine for a single developer hacking on a side project. The moment you have a teammate or a CI pipeline, you need a remote backend. The canonical AWS setup:

terraform {
  backend "s3" {
    bucket         = "my-org-terraform-state"
    key            = "envs/prod/network.tfstate"
    region         = "eu-west-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}

Now the state lives in S3 (versioned, encrypted), and DynamoDB acts as the lock. Two teammates running terraform apply simultaneously: the second one gets Error acquiring the state lock, sits and waits, and applies after the first finishes. No race, no corruption.

The chicken-and-egg problem: how do you Terraform-create the S3 bucket that holds your Terraform state? You don’t, on day zero. You create it manually (or with a tiny bootstrap config using local state), then add the backend block, then run terraform init, which migrates the local state to S3 and asks if that’s what you wanted. After that, the bucket lives in its own state file forever after, ideally in a separate config from anything else.

OpenTofu has built-in state encryption. With Terraform proper, you rely on backend encryption (S3 SSE) and treat the state file like a private key — because it contains every value Terraform has touched, including secrets passed through it.

The PR workflow

The mature workflow looks like this:

  1. Engineer opens a branch, makes changes to .tf files.
  2. Opens a PR. CI runs terraform plan and posts the output as a PR comment.
  3. Reviewer reads the plan. Not the code — the plan. The code is the intent; the plan is what will actually happen.
  4. PR approved. Merge to main triggers terraform apply in CI, or a human runs apply against a saved plan file.

The critical step is reading the plan. A beginner says “the code looks fine” and merges. An experienced engineer reads Plan: 12 to add, 4 to change, 1 to destroy and says “wait, what’s the destroy?” — because that’s almost never what you wanted, and 70% of the time it’s a refactor that should have used a moved block.

Refactoring without destroying

When you rename a resource, Terraform’s default behavior is brutal: the old address disappears from config, the new address appears, and the plan says “destroy old, create new.” For an EC2 instance, fine. For an RDS database with three years of customer data, catastrophic.

The fix is the moved block:

moved {
  from = aws_instance.web
  to   = aws_instance.app_server
}

resource "aws_instance" "app_server" {
  # ... same config as before ...
}

Plan now says aws_instance.web has moved to aws_instance.app_server. No changes. The state file is updated; the cloud is untouched. Same trick works for moving a resource into a module, or for migrating from count to for_each. Leave the moved block in for at least one release cycle so all environments have applied it; then you can remove it.

The import block (Terraform 1.5+) does the inverse: declare an existing cloud resource and bring it under management.

import {
  to = aws_instance.legacy
  id = "i-0123456789abcdef0"
}

resource "aws_instance" "legacy" {
  # ... config that matches the existing instance ...
}

With terraform plan -generate-config-out=imported.tf, Terraform will scaffold the resource block for you. Don’t trust the scaffold; review it line by line. Adopting brownfield infrastructure into Terraform is one of the harder things you’ll do, and import is the safe-ish entry point.

terraform destroy and lifecycle

terraform destroy deletes everything in the current state. Use it freely in dev. In prod, never. To protect specific resources:

resource "aws_db_instance" "primary" {
  # ...
  lifecycle {
    prevent_destroy = true
  }
}

prevent_destroy = true makes Terraform refuse any plan that would destroy this resource. You will reach for this for databases, S3 buckets with data, and anything else where “recreate” means “lose everything.” The cost: when you legitimately do want to destroy it, you have to remove this line first, which is a small tax on a real action.

Other lifecycle settings:

  • create_before_destroy = true — for resources whose replacement would otherwise cause downtime (load balancers, certain DNS records). Terraform creates the new one first, swaps references, then deletes the old.
  • ignore_changes = [tags] — tells Terraform to leave drift on these attributes alone. Useful when something else (an autoscaler, a tagging policy) writes to them.
  • replace_triggered_by = [aws_security_group.web.id] — force replacement of this resource when some other resource changes. Niche but precious.

What you now know

That’s the working surface. Read those nine subsections again until they’re comfortable. Everything else in this document — mental model, gotchas, judgment calls, war stories — assumes you can write a small Terraform configuration, plan it, apply it, and explain to a colleague what each step is doing. If that’s true, you’ve replaced the tutorial. Now we go deeper.


5. The Mental Model

Four ideas. Hold these in your head and Terraform stops being mysterious. Forget any of them and you’ll be confused for years.

Core Idea 1: The state file is the source of truth, not the cloud.

Every operation Terraform runs is a comparison between three things: your configuration (what you wrote), the state (what Terraform thinks exists), and the cloud (what actually exists). The state is the authoritative middleman. When you run plan, Terraform reads the state, asks the cloud “is what I have in state still accurate?” via a refresh, then diffs the refreshed state against the configuration.

Crucially, anything that isn’t in the state file might as well not exist as far as Terraform is concerned. Manually created an S3 bucket in the console? Terraform doesn’t see it. Let an autoscaler add ten instances? Terraform doesn’t track them. Run terraform plan? It comes back clean. Your cloud is full of resources Terraform has no idea about, and Terraform is correct — they’re not in its state.

This predicts:

  • Drift is structural, not a bug. Anything that changes a resource outside Terraform creates a delta between state and reality. Terraform will silently ignore it until you next plan, and then either re-impose its old view (if the attribute is in config) or update state to match (if ignore_changes is set).
  • Losing the state file means losing control. It does not mean losing the infrastructure — your VPC and your databases keep running. But Terraform now has zero memory of them. You have to either rebuild the state by repeated terraform import (potentially hundreds of resources, hours of work, easy to get wrong), or accept that this infrastructure is no longer Terraform-managed.
  • State is more important than your code. Code is in git, reviewed, reproducible. State is a snapshot of unique mappings to real cloud resources, often containing secrets, and cannot be regenerated. Treat it with the paranoia of a private key. Versioned remote backend, encrypted at rest, locked down by IAM. Always.
  • Two state files cannot manage the same resource. State is a 1:1 mapping. If two configurations both think they own the same EC2 instance, they will fight, and the loser will plan to “fix” the winner’s changes on every apply. Splitting infrastructure across state files (which you should do — see Idea 4) requires drawing clean ownership lines.

Core Idea 2: Plan and apply are graph walks over a DAG.

Terraform doesn’t execute your .tf files top-to-bottom. It builds a Directed Acyclic Graph where every resource is a node and every reference is an edge, and then it walks the graph in topological order. Resources with no dependencies are created first. Resources that reference them are created next. Resources that don’t depend on each other are created in parallel — by default, up to 10 at a time, controlled by -parallelism=N.

This is why filename order doesn’t matter, and why aws_s3_bucket.logs.id referenced from an instance creates the bucket before the instance with no depends_on needed.

This predicts:

  • The order of resources in your code is irrelevant. What matters is what references what. Refactoring the layout of .tf files cannot change behavior.
  • Apply is not atomic. If you have 50 resources to create and the 30th fails, the first 29 still got created. Terraform records what succeeded in state and exits. You fix the failure and re-apply; the 29 are skipped (already in state) and the 30th is retried.
  • Cycles are fatal. If A depends on B and B depends on A, Terraform refuses to plan. You have to break the cycle structurally (often by splitting one resource into two — e.g. create the security group first, then add ingress rules separately). The error message will name the cycle; terraform graph -draw-cycles will draw it.
  • Parallelism is bounded by the slowest dependency chain. If one big chain takes 12 minutes, your apply takes 12 minutes regardless of parallelism. Wide, shallow graphs apply fast. Deep, narrow graphs apply slow. This is why a thousand-resource VPC config can apply in 90 seconds while a 30-resource RDS-with-snapshot-restore takes 25 minutes.
  • depends_on exists for invisible dependencies. When a dependency isn’t expressed via attribute reference (e.g. an IAM policy that needs to exist before some unrelated resource can be created, but no attribute crosses between them), depends_on adds the edge manually. Use it sparingly — every depends_on is a missed opportunity to express the dependency through data.

Core Idea 3: Resources have addresses, addresses are how state works, and changing an address means destroying the resource.

A resource’s address is aws_instance.web or module.vpc.aws_subnet.public[0] or module.app["api"].aws_iam_role.this. The address is a string in the state file that maps to a real cloud ID. Terraform does not track resources by their cloud ID. It tracks them by address.

If you change the address — by renaming a resource, moving it into a module, switching from count to for_each, or just typing a different name — Terraform sees the old address vanish (delete!) and a new address appear (create!), even if the underlying configuration is identical. This is why renaming a resource by accident destroys it.

This predicts:

  • Refactoring is dangerous by default. The moved block (1.1+) was created precisely to fix this. Without it, every rename is a destroy/create.
  • count is fragile because positions are addresses. aws_instance.workers[2] is bound to whatever was at index 2 when state was last written. Removing index 1 from the list shifts index 2 to index 1, and Terraform thinks [2] got destroyed and [1] got modified.
  • for_each is sturdy because keys are addresses. aws_instance.workers["api"] is bound to whatever was named api. Removing worker from the map only affects worker’s entry. Names don’t shift.
  • terraform import is “create an address and bind it to an existing cloud ID.” The HCL is your scaffolding; the binding lives in state.
  • terraform state rm deletes the binding without touching the cloud. The resource keeps running, but Terraform forgets it. Used during refactors when you want to move a resource between state files (paired with terraform import on the destination).

Core Idea 4: Terraform Core knows nothing about your cloud. Everything cloud-specific lives in providers.

When you write resource "aws_instance" "web", Terraform Core has no idea what an aws_instance is. It just knows there’s a provider called aws that claims to handle resources of type aws_instance. Core asks the provider four questions, in order:

  1. What’s the schema of this resource type? (used to validate config)
  2. What does the existing resource look like, given this state? (refresh)
  3. Given this config and this current state, what changes would you make? (plan)
  4. OK, do it. (apply)

The provider answers all four. Core just orchestrates. The communication between Core and providers is gRPC over a local socket — providers run as separate processes, spun up by terraform init.

This predicts:

  • Provider bugs are common, and provider quality varies wildly. The major providers (AWS, Azure, GCP, Kubernetes) are excellent. Long-tail providers can be old, half-maintained, or wrong. When a terraform plan output looks insane, the provider is at fault more often than Core.
  • Provider versions matter as much as Terraform versions. A new AWS provider release can change behavior, add deprecation warnings, change defaults, or — rarely — break existing configs. Pin your provider versions. Update deliberately.
  • The provider chooses what counts as drift. If AWS’s API returns an attribute you didn’t set, the provider decides whether to surface that as a diff. This is why some attributes appear in plans and others don’t. It’s a provider-author decision.
  • You can write custom providers. If your company has an internal API and you want it managed by Terraform, you write a provider. It’s a real Go project — non-trivial — but it’s the supported path for “Terraform manages our internal stuff too.”
  • The plan/apply gap exists because plan can’t always know. Sometimes a provider can’t tell you what an attribute will be until apply (e.g. a generated ID, a computed ARN). The plan shows (known after apply). This is honest; it’s the provider admitting it can’t simulate the API call.

These four ideas — state is truth, graph drives execution, addresses bind state to reality, providers do the work — are the entire mental model. Every gotcha, every judgment call, every weird error message you’ll ever see in Terraform connects back to one of them.


6. The Architecture in Plain English

Walk through a terraform apply end to end, in narrative form. This is what’s actually happening when you press the button.

You type terraform apply in a directory containing .tf files. The Terraform binary starts up.

Step 1: load. Terraform parses every .tf file in the working directory using the HCL parser. There’s no main file, no entry point — it slurps everything alphabetically and merges it into one in-memory configuration object. It validates that resource types and provider names are spelled correctly, that variable types match, and that no two resources share an address.

Step 2: init the backend. It reads the terraform block, finds the backend declaration (or defaults to local), and connects. For an S3 backend, it makes an API call to fetch the current state file from s3://your-bucket/your-key.tfstate. For HCP Terraform, it authenticates and pulls state via API. The state file is now in memory as a JSON document.

Step 3: acquire the lock. Before doing anything else, Terraform asks the lock backend (DynamoDB for S3, the same database for HCP) for an exclusive lock on this state. If someone else is already running, the call fails: Error acquiring the state lock. If you’re alone, the lock is yours until you exit.

Step 4: launch providers. Terraform reads required_providers, finds the binaries that terraform init previously downloaded into .terraform/providers/, and spawns one process per unique provider configuration. (Two aws providers with different alias and different region get two processes; two with the same config share one.) Each provider exposes a gRPC server on a local socket. Terraform Core opens a connection to each.

Step 5: build the graph. This is where the magic happens. Terraform walks the parsed configuration and creates a node for every resource, data source, module call, and provider configuration. Then it scans expressions for references — every time it sees aws_subnet.public.id, it adds an edge from aws_subnet.public to whatever block referenced it. Explicit depends_on adds more edges. The result is a DAG.

If you ran terraform plan -out=plan.tfplan and then terraform graph -plan=plan.tfplan, you’d see this graph rendered as DOT. It’s worth doing once. The graph for a real infrastructure has the shape you’d expect: VPCs at the bottom, subnets on top of them, route tables and gateways at the same level, then security groups, then instances and load balancers, all connected by the references in your code.

Step 6: refresh. For every resource currently in state, Terraform asks the relevant provider: “go ask the cloud what this thing looks like right now.” The provider makes an API call (e.g. DescribeInstances for an EC2 instance), takes the response, normalizes it into the schema, and returns it. Terraform updates the in-memory state to match. This is when drift becomes visible. If somebody manually changed your security group, the refresh discovers it.

(You can skip refresh with -refresh=false if you’re in a hurry and trust state. Almost never the right move.)

Step 7: plan. Terraform walks the graph and, for each resource, asks the provider: “given this config and this current state, what’s the diff?” The provider returns a structured diff: which attributes change, which need replacement (i.e. destroy-and-create rather than in-place update), and which are computed and unknown until apply. Terraform aggregates all the diffs into the human-readable plan you read on screen. Resources to add (+), to change (~), to destroy (-), to replace (-/+).

Step 8: confirm. Apply prints the plan and asks “Enter a value: yes”. This is your last chance to bail out. Read the plan. Look for unexpected destroys. Look at the destroy count. Anyone running apply and not reading the output is one typo away from an outage.

Step 9: apply, in graph order. Terraform performs a parallel topological walk of the graph. Up to 10 nodes are processed concurrently. For each node, it asks the provider to make the API calls (Create, Update, or Delete). The provider returns the new state of the resource on success, or an error on failure. Terraform updates the in-memory state with the new attributes — immediately, after each successful operation, not just at the end.

Step 10: persist state. As resources change, Terraform pushes state updates back to the backend. With S3, that means a PutObject call after every state mutation. With HCP, it’s API calls. Why incremental? Because if apply crashes halfway through, the state file already reflects everything that succeeded. You can re-run apply and it picks up where it left off. State persistence during apply is the difference between “we lost track of 14 resources” and “we’ll just continue.”

Step 11: release the lock. Terraform releases the DynamoDB lock and exits. State is in S3, with versioning if you set that up (you did set that up, right?). The next person to run apply will pull the new state and start from there.

A few details worth knowing:

Where does state actually live during apply? In memory in the Terraform process, with periodic writes to the backend. If you kill -9 Terraform mid-apply, you lose any state changes that hadn’t been flushed yet, and — worse — you may leave the lock in place, requiring a manual terraform force-unlock. This is why “stuck locks” are a thing.

What does a provider actually do? It’s a Go binary that implements a gRPC interface defined by HashiCorp. The interface has methods like GetSchema, PlanResourceChange, ApplyResourceChange, ReadResource. The provider author writes a Go function for each of these, calling the underlying SDK (e.g. AWS SDK for Go) inside. So when Terraform asks “create this S3 bucket,” the AWS provider’s ApplyResourceChange function runs, builds an s3.CreateBucketInput, calls s3.CreateBucket, parses the response, and returns the resulting attributes to Core. Same shape for every cloud, every resource type.

Where do data sources fit? Data sources are read-only “resources” that the provider implements with a ReadDataSource method. They’re refreshed during plan (so their values are known by the time apply runs) and don’t appear in destroy plans. They’re how you import information from elsewhere — the latest AMI, an existing subnet, a secret from Vault — without managing it.

What about .terraform/? That’s the working-directory cache. It contains the downloaded provider binaries, a copy of any modules sourced from non-local paths (git, registry), and the lock file .terraform.lock.hcl (provider version pins, byte-for-byte). The whole directory is throwaway; rm -rf .terraform && terraform init reconstructs it. Never commit it.

That’s the entire architecture. There’s no daemon, no server (HCP Terraform aside), no shared cluster. Every apply is a self-contained process that reads state, talks to providers, talks to the cloud, and writes state back.


7. The Things That Bite You

Eight gotchas. Each one has caused real production incidents. Each one connects back to the mental model.

Gotcha 1: count and the index shift

You wrote this:

resource "aws_instance" "worker" {
  count = length(var.workers)
  ami   = var.workers[count.index].ami
  tags  = { Name = var.workers[count.index].name }
}

var.workers was [{name="a"}, {name="b"}, {name="c"}]. State has aws_instance.worker[0], [1], [2]. Now you remove b. Plan says: [1] will be modified (its Name tag goes from “b” to “c”), [2] will be destroyed. Two resources affected, not one.

The intuitive model says removing b should destroy b. The actual model says removing the middle of a list shifts every later index, and Terraform tracks by index. The address aws_instance.worker[1] is bound to whichever object happens to be at position 1 when state is written. Because position is the address (Idea 3), shifting positions destroys identity.

Fix: use for_each with a map keyed by name. Then aws_instance.worker["b"] is bound to whatever’s named b, and removing b only affects b.

When you have to use count (e.g. conditional creation with count = var.create ? 1 : 0), make sure the underlying list is order-stable. Don’t sort it. Don’t filter it. Don’t use it to “iterate over a map” — that’s literally what for_each is for.

Gotcha 2: Manual changes silently disappear on the next apply

The classic 2am incident: production is broken, an engineer opens the console, changes a security group rule to fix it, the fix works, everyone goes home. A week later someone runs terraform apply for an unrelated change. The plan looks fine on a quick scan; they hit yes. Production breaks again — Terraform reverted the security group rule because the rule wasn’t in config and Terraform’s state still said the rule should be the old one.

This is Idea 1 biting you: state is the source of truth. The console change put the cloud out of sync with state. Refresh detected it. Plan said “I will fix this drift.” The “fix” was destroying the manual change.

There are exactly two correct responses to a manual change in Terraform-managed infrastructure:

  1. Update the Terraform code to match what you did manually, then apply. The code now reflects reality. Drift gone.
  2. Revert the manual change. The cloud is back in sync with code. Drift gone.

Anything else — leaving the manual change and “remembering to fix it later” — is a ticking bomb. The only way to defuse it permanently is to make the pipeline the only path that touches managed resources. Easier said than done at scale.

Gotcha 3: Sensitive values in state are still in state

variable "db_password" { sensitive = true } and output "password" { sensitive = true } redact values from console output. They do not encrypt the value at rest in the state file. The state file is JSON; the password is sitting there in plain text.

Idea 1 again: state contains every value Terraform has touched. If your config takes a database password as a variable and feeds it to aws_db_instance.master_password, that password lives in state forever after.

This is why you always:

  • Use a remote backend that encrypts at rest (S3 with SSE, GCS with CMEK, OpenTofu with state encryption).
  • Lock the backend down with IAM as if it were a key vault.
  • Source secrets from a real secret manager (Vault, AWS Secrets Manager, GCP Secret Manager) using data blocks, so you’re not even passing them through variables.
  • Never, ever commit a state file to git.

The error mode where this matters most: someone clones the repo to “investigate,” runs terraform init against a dev backend, and inadvertently pulls a state file with prod secrets onto their laptop. Audit logs in S3 will at least let you spot it.

Gotcha 4: terraform plan shows green, terraform apply errors out

Plans and applies are different operations. Plan asks providers “what would you do?” — the provider can answer based on schema validation and what it knows from state, but it can’t always know what the API will accept. Apply makes the actual API call.

So plan can pass and apply can fail because:

  • The IAM principal had read perms but not write perms.
  • The cloud provider’s API rejected the request (quota, service limit, validation rule).
  • A required attribute was (known after apply) and turned out to be incompatible with another resource.
  • A race condition with another tool that’s also touching the same resource.
  • The provider’s local validation didn’t catch what the API enforces.

When this happens, the apply has typically partially succeeded. Read the error carefully, fix the cause, re-run apply. Terraform will skip what already worked and retry the rest. Don’t panic and don’t terraform destroy.

Gotcha 5: Stuck locks

You started an apply. Your laptop’s wifi died. Or you Ctrl-C’d when you got scared. Or the CI runner crashed. Now there’s a lock entry in DynamoDB and nobody can apply anything.

terraform plan and terraform apply will fail with Error acquiring the state lock. The lock entry includes the username and a timestamp, so you can verify it’s stale.

Fix: terraform force-unlock <LOCK_ID>, where LOCK_ID is in the error message. Done. But: only do this if you’re certain no apply is actually running. Force-unlocking a live apply means two applies can run concurrently, and that’s how you get a corrupted state file.

The defense: in CI, use terraform apply -lock-timeout=10m so transient races get a chance to resolve. And teach everyone the difference between “Terraform is hung” (let it finish) and “Terraform is dead” (force-unlock).

Gotcha 6: Provider version drift across machines

You committed your .tf files but not .terraform.lock.hcl. (Or you committed it but didn’t pin provider versions.) Engineer A runs terraform init in March, gets AWS provider 5.20. Engineer B runs terraform init in May, gets AWS provider 5.45. The 5.45 release changed how some attribute is normalized. Plan from A: clean. Plan from B: 17 resources to update. Neither engineer is wrong; they’re using different providers.

Fix:

  • terraform init generates .terraform.lock.hclcommit this file. It pins provider versions and checksums.
  • Pin version = "5.20.0" (exact) or ~> 5.20 (patch updates only) in required_providers. Be deliberate about updates.
  • In CI, run terraform init -lockfile=readonly so CI fails if the lockfile would change. That forces version updates to be explicit code changes.

Gotcha 7: Workspaces are not for environments

terraform workspace new prod looks like the obvious way to manage dev/staging/prod from the same code. It is a trap.

Workspaces are a thin layer that gives you N copies of the state file from the same configuration directory. They are useful for ephemeral, structurally identical things — a per-developer sandbox, a feature-branch preview environment. They are bad for prod/staging/dev because:

  • The configuration is the same, so you can’t easily express prod differences (different instance sizes, different counts, different feature flags) without terraform.workspace == "prod" ? ... : ... ternaries scattered everywhere. Code becomes unreadable.
  • A typo on terraform workspace select deploys staging changes to prod. There is no second confirmation.
  • You can’t run different provider versions, different backends, different state encryption between workspaces.
  • The blast radius of a terraform destroy is the entire workspace. With separate directories, you can’t even target the wrong env without changing CWD.

The mature pattern is separate directories per environment, each with its own backend config and terraform.tfvars, calling shared modules:

envs/
├── dev/
│   ├── backend.tf
│   └── main.tf       # module "app" { source = "../../modules/app"; size = "small" }
├── staging/
│   └── ...
└── prod/
    └── ...
modules/
└── app/
    └── ...

Some duplication. Worth it. Your prod state file should not be one typo away.

Gotcha 8: terraform destroy and prevent_destroy interaction

You added lifecycle { prevent_destroy = true } to your prod RDS instance. Good. Six months later, someone runs terraform destroy to clean up an old environment, hits prevent_destroy, and instead of just not destroying the database, the entire destroy operation fails. Half the resources are now gone (the ones planned before the protected one), the protected resource and the rest are still there, state is in a confusing state.

prevent_destroy doesn’t gracefully exclude a resource from destroy. It hard-errors the entire run when it sees a destroy planned for that resource. Usually that’s what you want — fail loud, don’t delete. But it does mean terraform destroy against a config with any prevent_destroy resource needs surgery to actually run cleanly.

The fix: when you genuinely want to destroy, remove the prevent_destroy line in code, plan, apply (which is a no-op for that resource), then destroy. Two-step.

Better defense: don’t use terraform destroy against prod-shaped state files at all. Use targeted destroys (terraform destroy -target=...) for individual resources, and prefer removed blocks (1.7+) for the “stop managing this” case.


8. The Judgment Calls

Eight tradeoffs that an experienced engineer navigates in their head while writing Terraform. None of these has a universal right answer; the question is always “which way do the constraints point?”

Judgment Call 1: How fine-grained should state files be?

Option A: monolithic state. All your infrastructure in one state file, one root module. Easy to write, easy to reason about, single source of truth.

Option B: split per environment, per layer, or per service. One state file for networking, one for shared services, one per microservice, etc.

The novice answer is A because it’s simpler. The right answer is almost always B, for these reasons:

  • Blast radius. A typo in monolithic state can destroy production unrelated services. Split state confines the damage.
  • Plan time. Monolithic state of 1,500 resources can take 10–15 minutes to refresh. Split states refresh in seconds.
  • Locking contention. With one state, only one engineer can apply at a time across the entire org. Split states let networking changes run in parallel with application changes.
  • Permissions. A networking team and a database team can have different IAM access to different state files. With monolithic state, everybody who can apply can apply anything.

The signal: split when you have more than ~5 engineers writing Terraform regularly, or when applies start taking >2 minutes, or when the destroy plan output exceeds what fits on one screen. The cost of split state is some duplication (each state needs its own terraform block, backend config) and the need for cross-state references (typically via terraform_remote_state data source or — better — explicit data lookups against the cloud).

Default: split by environment first (dev/staging/prod always separate states), then by ownership boundary (networking, security, data, app), then by service. Resist the urge to split too fine — managing 200 state files is its own nightmare.

Judgment Call 2: count vs for_each

The default answer is for_each, always. The exceptions:

  • Conditional creation of a single resource: count = var.enabled ? 1 : 0 is the canonical pattern. for_each doesn’t have a clean idiom for this.
  • Strictly homogeneous resources where you genuinely don’t care about identity: a fixed pool of three identical workers, none of which has any meaningful name. Even here, prefer for_each = toset(["worker-a", "worker-b", "worker-c"]).

If your resources have any kind of identity that survives over time — names, environments, regions, IDs — use for_each. Saves you from index-shift incidents and makes state output far more readable.

Judgment Call 3: How much logic to put in HCL vs externalize

HCL has loops, conditionals, functions, types. It can do a lot. But it’s not a real programming language and the moment you start nesting for expressions inside merge() calls inside ternaries, your code becomes unreadable.

Option A: cram logic into HCL. Dynamic blocks, complex for expressions, conditional resources. Single source of truth.

Option B: generate Terraform from a higher-level tool. Terragrunt, CDK for Terraform, Pulumi (different tool, same idea), in-house code generation.

Option C: just write more Terraform. Repeat yourself a little. Make the duplication boring instead of clever.

The experienced choice is usually C until the duplication exceeds about 3-4 instances, then a careful application of B. Cramming logic into HCL is rarely the right answer above a small threshold — at that point you’re writing programs in a language designed for configuration and getting all the bugs that go with that decision.

The signal: if you can’t read your own HCL six months later, it’s too clever. If three engineers write HCL and only one of them understands the shared modules, the modules are too clever.

Judgment Call 4: Public modules vs roll your own

The Terraform Registry has thousands of modules. The popular ones (terraform-aws-modules/vpc/aws, etc.) are excellent — battle-tested, parameterizable, well-documented.

Use a public module when: you need a standard piece of infrastructure (VPC, EKS cluster, RDS instance with sensible defaults), the public module covers your use case with parameters, and your security/compliance team is OK with the upstream maintainers.

Roll your own when: your needs are unusual, you need to enforce internal policy (specific tags, specific naming conventions, specific encryption), or the public module’s surface area is way bigger than you need. Rolling your own usually starts as “wrap the public module with our defaults” and gradually replaces it as needs diverge.

The mistake to avoid: writing a module that’s a thin wrapper around a single resource. module "my_s3_bucket" { source = "./modules/s3"; name = "foo" } calling a module that creates one S3 bucket adds nothing but indirection. A module earns its place when it bundles 3+ related resources that share lifecycle, naming, and configuration.

Judgment Call 5: HCP Terraform vs self-hosted state

HCP Terraform (HashiCorp’s managed service): state hosting, remote runs, RBAC, audit logs, policy as code (Sentinel), private module registry. You pay per workspace, per user.

Self-hosted state in S3 + DynamoDB + GitHub Actions / CI: cheaper, more control, you own the infrastructure. No built-in RBAC beyond IAM, no run UI, no policy enforcement except what you build.

For small teams (<10 engineers), self-hosted is fine. For larger orgs, HCP buys you governance: audit logs that auditors will accept, run history you can search, policy as code that can stop bad applies before they happen. The price gets uncomfortable as you scale.

The 2026 wrinkle: HashiCorp is now IBM. License is BSL. Some organizations now run OpenTofu + a third-party platform (Spacelift, env0, Scalr) instead of HCP. These platforms offer most of what HCP does, often at better prices, with the OpenTofu engine. If you’re starting fresh in 2026, this combination is a strong default.

Judgment Call 6: Drift detection: scheduled or on-demand?

Drift will happen. The question is whether you find it on Tuesday morning when CI runs detection, or on Friday at 4pm when someone’s plan suddenly shows 17 unexpected changes.

Option A: scheduled drift detection. Cron job (or HCP’s drift detection feature) runs terraform plan -detailed-exitcode against every state file daily/hourly. Non-zero exit code triggers an alert.

Option B: on-demand only. People discover drift when they next plan.

A is the right answer for any prod-shaped infrastructure. The cost is real (you’ll get false positives from auto-scalers, from in-place AWS service updates, from things you ignore_changes should cover) but the alternative is discovering drift only when it’s already mixed up with your real changes. The experienced move is to invest a few hours in clean drift detection that’s quiet by default and loud when it should be — ignore_changes on attributes that legitimately drift (autoscaler counts, RDS minor versions), alerting on anything else.

Judgment Call 7: When to import existing infrastructure

You inherited a cloud account full of stuff that wasn’t built with Terraform. Three options:

Option A: import everything into Terraform. Comprehensive, rigorous, long. Six weeks per engineer per environment is not unrealistic for a complex account. Every resource needs HCL written for it (manually or via import block with -generate-config-out), every import has to be plan-tested to confirm zero drift, every quirk of how the resource was originally built has to be encoded.

Option B: import only what you actively manage; let the rest live as drift outside Terraform’s view. Less rigorous, faster. Acknowledges that some legacy resources are unlikely to ever be touched again. Risk: when something breaks and you have to change it, you’ll have to import it under pressure.

Option C: rebuild greenfield in Terraform alongside, then cut over. Most aggressive. Often the right call for “we want to migrate from old AWS account to new AWS account anyway, while we’re at it.”

The signal: import is for the resources you’ll actively manage going forward. If a resource hasn’t changed in a year and probably won’t, importing it is busywork. Document it in a “known unmanaged” list and move on. Import the things that will be touched.

Judgment Call 8: How to handle secrets

Secrets and Terraform have an awkward relationship because of Idea 1 — anything Terraform touches lives in state. There are three patterns, in increasing order of safety:

Pattern A: variables. The secret is passed as a variable, possibly from TF_VAR_* env vars. Lives in state. Acceptable only if state is genuinely locked down.

Pattern B: data lookups. The secret lives in a real secret manager (Vault, AWS Secrets Manager, GCP Secret Manager). Terraform reads it via a data block. The value still ends up in state (because Terraform “knows” it), but the source of truth is elsewhere — rotation can happen without Terraform.

Pattern C: references that aren’t values. Pass an ARN or a secret-manager-key into Terraform, not the secret itself. Have the consuming resource (e.g. a Lambda function, an ECS task definition) fetch the secret at runtime by ARN. Terraform never sees the value. This is the right answer and the most awkward to set up.

A mature setup uses Pattern C for production secrets, Pattern B for read-only lookups (where the consuming resource genuinely needs the value, like an RDS master password at creation time), and Pattern A for nothing.


9. The Commands That Actually Matter

You’ve seen most of these in context. This is the quick-reference, grouped by task, with the flags experienced users actually reach for and why.

Setup and initialization

terraform init                          # download providers, init backend, fetch modules
terraform init -upgrade                 # also upgrade providers/modules within version constraints
terraform init -reconfigure             # ignore existing backend config and re-init
terraform init -backend-config=dev.hcl  # partial backend config; useful for per-env backends

-reconfigure is what you reach for when you’re switching backends (e.g. moving from local to S3) and Terraform is being stubborn about migrating the existing state. -backend-config= lets you keep the backend block in code mostly empty and pass per-environment values from a file or CLI arg.

The plan/apply loop

terraform plan                          # show what would happen
terraform plan -out=tfplan              # save the plan to a file
terraform plan -detailed-exitcode       # exit 0 = no changes, 2 = changes pending, 1 = error
terraform plan -target=aws_instance.web # plan only this resource (last resort)
terraform plan -refresh=false           # skip the refresh step (faster, riskier)
terraform plan -var-file=prod.tfvars    # use this var file
terraform apply tfplan                  # apply the saved plan exactly
terraform apply -auto-approve           # skip the yes prompt (CI only)

-out=tfplan is the most important pattern most engineers don’t use. Save a plan, review it, apply that exact plan. Without it, the apply re-plans, and the new plan can differ from what you reviewed (someone else’s changes might have landed in the meantime). The combination terraform plan -out=tfplan && terraform apply tfplan is the gold-standard CI workflow.

-detailed-exitcode is what scheduled drift detection runs on. Exit 2 means drift exists.

-target is a smell when used in normal operations. It’s there for emergency surgery, not regular use. If you find yourself reaching for -target daily, your state files are too coarse-grained — see Judgment Call 1.

State inspection and surgery

terraform show                          # human-readable view of state
terraform show -json                    # machine-readable JSON of state
terraform state list                    # list all resource addresses in state
terraform state show aws_instance.web   # detail one resource
terraform state pull > state.tfstate    # download remote state (for inspection)
terraform state push state.tfstate      # upload local state (DANGEROUS)
terraform state rm aws_instance.web     # remove from state without destroying
terraform state mv aws_instance.web aws_instance.app   # rename in state

terraform state list | grep something is one of the most common things you’ll do. Useful for “did I actually create that thing?”, “what’s in this state file?”, and “find every resource in this module.”

terraform state pull and push are the underlying mechanism for state surgery. pull is safe; push is how you corrupt remote state if you make a mistake. Almost never use push directly — let Terraform manage it.

terraform state rm and mv exist, but the modern (1.1+ / 1.7+) replacements are moved and removed blocks in code. Prefer those — they’re reviewable in PRs, idempotent across environments, and don’t require shell history archaeology to reconstruct.

Refresh and drift

terraform apply -refresh-only           # update state to match cloud, no other changes
terraform plan -refresh-only            # show what refresh would change

These replaced the old terraform refresh command. Use them when you suspect drift and want to see it without the risk of an apply doing anything else.

Imports and destruction

terraform import aws_instance.legacy i-0abc123...   # bring an existing resource under management
terraform plan -generate-config-out=imports.tf      # scaffold HCL for `import` blocks (1.5+)
terraform destroy                                   # destroy everything in state
terraform destroy -target=aws_instance.scratch      # destroy one specific thing
terraform apply -destroy                            # equivalent to `terraform destroy`

terraform import is the imperative form; the import block in HCL is the declarative form (1.5+) and the better choice in almost all cases. -generate-config-out produces a starting configuration; treat it like a beginner engineer’s first PR — it works, but read every line.

Workspaces (use sparingly)

terraform workspace list                # show all workspaces
terraform workspace show                # show current workspace
terraform workspace new <name>          # create
terraform workspace select <name>       # switch

Useful for ephemeral / per-developer environments. Not for prod/staging/dev (Gotcha 7).

Validation and formatting

terraform fmt                           # auto-format .tf files
terraform fmt -recursive                # whole tree
terraform fmt -check                    # exit nonzero if any file would change (CI)
terraform validate                      # check syntax and types without contacting any backend
terraform graph                         # output dependency graph in DOT
terraform graph -type=plan | dot -Tpng > graph.png   # render visually

terraform fmt -check and terraform validate should be the cheapest steps in your CI. Run them on every commit. They don’t talk to the cloud and they catch the boring class of errors that should never reach human review.

Force-unlock (use very sparingly)

terraform force-unlock <LOCK_ID>        # ONLY when you're certain no apply is running

If you have to do this more than once a quarter, your CI is flaky in a way that’s worth fixing.


10. How It Breaks

Failure modes you will see, with how to debug them.

Failure 1: “Error acquiring the state lock”

Symptoms: every plan/apply hangs and eventually errors out. Lock info names someone (or some CI run).

Root cause: a previous apply died without releasing the lock. Connects to architecture step 11.

Diagnose: read the error — it tells you the lock ID, who took it, when. Verify nothing is actually running (check CI, check colleagues’ shells). For DynamoDB locks, you can read the lock entry directly: aws dynamodb get-item --table-name terraform-state-lock --key ....

Fix: terraform force-unlock <LOCK_ID>. Then investigate what killed the original run — usually a CI timeout, a dropped network connection, or a Ctrl-C. If your team hits this regularly, lengthen CI timeouts and stop force-killing applies.

Failure 2: state file corruption

Symptoms: terraform plan errors out with “state snapshot was created by Terraform v1.x.x, which is newer than current v1.y.y” or “JSON parse error” or weirder.

Root cause: usually a partially-written state push (network failure during apply), an explicitly broken terraform state push, or someone editing the JSON by hand.

Diagnose: terraform state pull > current.tfstate; cat current.tfstate | jq . will tell you if it’s valid JSON. Check S3 versioning history for previous versions.

Fix: roll back to the last known-good version from S3 versioning. If you don’t have versioning enabled… well, now you know why versioning was non-negotiable. Worst case: rebuild state via terraform import for every resource (hours to days of work).

This is why you treat state as “the most dangerous file in the system.” Versioning, encryption, restricted access. Always.

Failure 3: dependency cycle

Symptoms: Error: Cycle: aws_security_group.web, aws_security_group_rule.allow_db, aws_security_group.db

Root cause: A references B references A. Connects to Idea 2.

Diagnose: terraform graph -draw-cycles | dot -Tpng > cycle.png. The cycle is highlighted.

Fix: structurally break the cycle. The classic case is two security groups with rules that reference each other. The fix is to define the security groups without their rules first, then define the cross-referencing rules in separate aws_security_group_rule (or aws_vpc_security_group_ingress_rule) resources. The rule resources can reference both SGs because they’re created after both SGs exist.

Failure 4: provider plugin crashed

Symptoms: “Error: The terraform-provider-aws_v5.x.x plugin crashed!” with a stack trace.

Root cause: a bug in the provider, almost always exposed by an unusual input or a corner-case API response.

Diagnose: check the provider’s GitHub issues for your error or stack trace. Often it’s a known bug with a workaround.

Fix: usually one of three things. (1) Upgrade or downgrade the provider — version = "~> 5.30" to version = "5.28.0". (2) Restructure the config to avoid the bug — sometimes a dynamic block triggers it where a static block doesn’t. (3) File a bug if it’s new. The major providers have responsive teams.

Failure 5: unexpected destroys in plan

Symptoms: Plan: 0 to add, 1 to change, 12 to destroy. Twelve was not in your spreadsheet.

Root cause: you renamed something. Or you removed an item from a count list. Or you switched count to for_each without a moved block. Or you accidentally deleted a chunk of code.

Diagnose: scroll through the plan and read every # X will be destroyed line. Are they things you actually wrote out of the configuration? If the names look correct but the resources should still exist, you’ve hit Idea 3 — the addresses changed, and Terraform sees that as destroy/create.

Fix: stop. Don’t apply. Add moved blocks to update state addresses without destruction. Re-plan. The plan should now show 0 to destroy. Then apply.

The 5-second rule for plans: every line that starts with - (will be destroyed) needs to be intentional. If you can’t immediately explain why each - is there, do not type “yes.”

Failure 6: drift overwriting your manual fix

Symptoms: “I fixed this last week and now it’s broken again.” Production was patched manually, the next apply reverted it.

Root cause: Idea 1. Connects to Gotcha 2.

Diagnose: terraform plan -refresh-only will show you the drift between state and reality. The diff is the manual change.

Fix: choose one — accept the manual change by updating Terraform code to match and applying, or reject it by leaving state alone (the next apply will revert). Then close the loop on whatever process let the manual change happen in the first place.

The general debugging workflow

When something looks weird:

  1. terraform validate — does the code even parse? (Cheap; catches dumb stuff.)
  2. terraform fmt -check — is anything obviously wrong with whitespace? (Sometimes copy-paste creates parse errors.)
  3. terraform state list | grep <thing> — is the thing you expected in state?
  4. terraform state show <addr> — what does Terraform think the resource looks like?
  5. terraform plan -refresh-only — has the cloud diverged from state without us noticing?
  6. TF_LOG=DEBUG terraform plan 2> debug.log — what’s actually happening underneath? (Verbose, useful when nothing else is.)
  7. terraform graph -type=plan | dot -Tpng > g.png — what does Terraform think the dependency structure is?
  8. Check the provider’s GitHub issues for your symptom.

Steps 1–3 catch 90% of issues. Steps 4–6 catch most of the rest. Step 7 is for cycle/ordering weirdness. Step 8 is for “this looks like a bug, not a misconfig.”


11. The Downsides

After all the praise: here’s what you’re signing up for. None of this means Terraform is a bad choice. It means walk in with eyes open.

Downside 1: state is a critical, fragile, single-instance resource

Where it comes from: Idea 1. State has to exist for the system to work. It cannot be regenerated. It can be locked, but only one operation can run against it at a time.

Cost in practice: state corruption causes hours-to-days outages. State loss is a multi-week recovery exercise. State file growth (1,500+ resources in one file) makes plans take 10–15 minutes. Lock contention serializes your team’s deployments — Engineer A holds the lock for 8 minutes, everyone else waits. The HashiCorp State of Cloud Strategy Survey consistently puts state issues at the top of pain points for orgs above 50 engineers.

When it’s a dealbreaker: it isn’t, exactly — but if you’re at the scale where you need many engineers deploying many times per day, you have to invest in state design (split state, automated backup verification, drift detection) as if it were a production database. Because it is.

What people think mitigates it but doesn’t: workspaces. They give you separate state but not separate locks at the operation level — and they don’t fix the architectural issue of “the same workspace is one critical-section.” Splitting into separate root modules with separate state files is the only real fix.

Downside 2: HCL is a configuration language pretending to be a programming language

Where it comes from: HCL was designed for static, declarative resource description. Over time, real users needed loops, conditionals, transforms — so HashiCorp added them, but as expressions rather than statements. The result is a language that can compute things but feels like writing a program inside a JSON file.

Cost in practice: complex expressions (nested for, merge, ternaries, type-juggling between maps and objects) become unreadable fast. There’s no debugger. There’s no print statement (the closest is a null_resource with triggers set to a value you want to inspect, then read the plan — terrible). Type errors surface late and confusingly. Refactoring complex modules is genuinely hard. Tooling has improved (the official VS Code extension is decent now) but it’s still nowhere near what a Java or TypeScript developer takes for granted.

When it’s a dealbreaker: when your team is mostly application engineers who don’t accept the “infrastructure code is different” framing. They will reach for Pulumi or AWS CDK so they can write infrastructure in Python or TypeScript and use the same testing, debugging, and refactoring tools they use for app code. That’s a legitimate choice.

Downside 3: drift is structural and Terraform doesn’t track resources it didn’t create

Where it comes from: Idea 1. Terraform tracks what’s in state. It does not scan your cloud for resources it doesn’t know about.

Cost in practice: out-of-band changes (manual fixes, autoscalers, service-managed updates like RDS minor version patches, ECS task count changes) silently desync state from reality. A clean terraform plan does not mean your cloud is what you think it is — it means your cloud matches what’s in state. Terraform-managed resources can drift; non-Terraform-managed resources are completely invisible. AWS Config, custom drift detection (Spacelift, env0, scheduled plan -detailed-exitcode), or third-party tools are required to close this gap, and none of them are free.

When it’s a dealbreaker: in highly regulated environments where you need definitive answers to “is anything in our cloud not in compliance?” Terraform alone cannot answer that question — it answers “of the things I know about, here’s what I think.” For full coverage, layer AWS Config / Azure Policy / GCP Asset Inventory underneath.

Downside 4: secrets in state forever

Where it comes from: state contains every value Terraform has touched. Sensitive markers redact display, not storage.

Cost in practice: every state backend needs to be treated like a vault. Engineers who casually aws s3 cp state files to their laptops for debugging are exfiltrating secrets. Every terraform state pull to stdout potentially leaks. Backups of state in dumb places (a build artifact bucket without proper IAM) are leaks. OpenTofu’s native state encryption helps; with stock Terraform, you rely entirely on the backend’s encryption-at-rest.

When it’s a dealbreaker: rare — most teams accept the locked-down-state-file model. But if your security team requires that infrastructure tooling never touch raw secrets at all, you have to architect every secret as a runtime fetch by reference (Pattern C in Judgment Call 8), which is genuinely more complex.

Downside 5: provider quality is uneven and provider versions are independent

Where it comes from: Idea 4. Providers are independently versioned, independently maintained, and the major providers are written by HashiCorp + cloud vendor teams while the long tail is community-maintained.

Cost in practice: AWS, Azure, GCP, Kubernetes are excellent. But you’ll inevitably want to manage something more obscure — an internal tool, a niche SaaS, a bleeding-edge cloud feature — and discover that the provider is six versions behind, or doesn’t expose half the API, or has bugs that have been open for two years. You’ll either work around the gap (manual changes outside Terraform — see Downside 3) or build a custom provider, which is a Go project in its own right.

When it’s a dealbreaker: when a critical part of your infrastructure depends on a tool whose Terraform provider is stale or absent. You’re then choosing between maintaining a fork of a provider, building your own, or accepting that this thing isn’t IaC-managed.

Downside 6: the BSL license complicates platform building

Where it comes from: HashiCorp’s 2023 license change from MPL to BSL.

Cost in practice: end-user organizations can use Terraform freely for internal infrastructure. Companies building products on top of Terraform face restrictions — explicitly, “competitive” use is forbidden. The exact boundaries are gray and depend on lawyers’ interpretations. This was the trigger for the OpenTofu fork. As of 2026, OpenTofu is mature enough that most platform vendors (Spacelift, env0, Scalr) have moved to OpenTofu specifically because of this. IBM’s acquisition of HashiCorp added another layer of vendor concentration concerns for some organizations.

When it’s a dealbreaker: if you’re a platform company, almost certainly. If you’re an end user, probably not — but the long-term direction of Terraform is now explicitly commercial-IBM, while OpenTofu’s is community-Linux-Foundation. The cultural difference matters over a five-year horizon.

Downside 7: refactoring large configurations is genuinely difficult

Where it comes from: Idea 3 — addresses are sticky. Combined with the inflexibility of HCL, splitting a 5,000-line root module into a clean module structure is a multi-week project, full of moved blocks, careful migrations, and per-environment apply windows.

Cost in practice: you’ll write your initial Terraform code in a month and live with its structural decisions for years. The longer it lives, the more painful refactoring becomes. Splitting a state file is even harder than splitting code — there’s no moved block that crosses state files; you need terraform state mv with -state-out, or removed+import block dances. Teams often live with structurally bad Terraform for years because the cost of fixing it exceeds the benefit.

When it’s a dealbreaker: it isn’t, but it’s a tax that compounds. Get the initial structure right (separate states per environment, modular code, clear ownership) because retrofitting these later is genuinely expensive.

Downside 8: plan/apply isn’t truly transactional

Where it comes from: cloud APIs aren’t transactional. Terraform applies a plan one resource at a time; if step 30 of 50 fails, the first 29 changes are real and persistent. There is no rollback.

Cost in practice: a half-applied plan leaves your infrastructure in a partial state. Sometimes that’s fine — the failed resource has clear semantics, you fix the cause and re-apply. Sometimes the partial state is itself broken — the load balancer was created but its target group failed, so traffic has nowhere to go. Recovery is manual: read the error, decide whether to retry, fix forward, or rollback by changing config and re-applying.

When it’s a dealbreaker: in deploys with strict rollback semantics (“if anything fails, undo everything”). Terraform doesn’t give you that natively. You build it on top — feature flags, blue/green at the application layer, careful staging — but it’s not a property of the IaC tool.


12. The Taste Test

How an experienced engineer recognizes good vs bad Terraform at a glance.

Code review red flags (the bad)

Hardcoded values everywhere. AMI IDs as string literals. Region names. Account numbers. t3.medium repeated in 14 places. An experienced engineer would extract these to variables, use data sources to look them up dynamically, or — minimum — parameterize through locals.

Single root module called main with everything in it. The “Terralith.” All environments, all services, all networking, one state. Slow plans, dangerous applies, no isolation. An experienced engineer splits at minimum by environment, ideally by ownership.

Workspaces used to switch between dev/staging/prod. See Gotcha 7. An experienced engineer uses separate directories.

count for non-conditional iteration. count = length(var.users) to create N users. An experienced engineer uses for_each with a map keyed by username.

depends_on everywhere. When depends_on is needed for routine wiring, the engineer didn’t understand reference-based dependency. An experienced engineer uses depends_on only for genuinely invisible dependencies (rare).

Wide-open ignore_changes = all as a way to “make Terraform stop fighting me.” This is a flashing red light — drift detection is being disabled wholesale. An experienced engineer identifies the specific attributes that legitimately drift and ignores those, with comments explaining why.

Provider versions unpinned, lock file uncommitted. An experienced engineer commits .terraform.lock.hcl and treats provider upgrades as deliberate code changes.

State file in git. Any state file in git is a P1 fire. Secrets exposure plus loss-of-source-of-truth.

Sentinel statements like count = some_complicated_ternary that resolve to different values across plans. Conditional creation that flickers. A experienced engineer makes conditions explicit and stable.

Modules that wrap a single resource. module "my_bucket" { source = "./modules/bucket"; name = "x" } calling code that creates one S3 bucket adds nothing but indirection. An experienced engineer reserves modules for genuinely reusable, multi-resource patterns.

Code review green flags (the good)

Separate state per environment, per layer. envs/prod/network/, envs/prod/data/, envs/prod/app/. Each with its own backend block.

for_each with maps keyed by stable names. for_each = var.services, each.key = "api", each.key = "worker".

Variables with types, descriptions, and validation.

variable "instance_type" {
  description = "EC2 instance type for application servers"
  type        = string
  default     = "t3.medium"
  validation {
    condition     = can(regex("^t3\\.(small|medium|large)$", var.instance_type))
    error_message = "instance_type must be t3.small, t3.medium, or t3.large."
  }
}

moved blocks left in for refactoring history. An experienced engineer leaves them in for at least one release cycle so all environments have applied.

Pinned provider versions, committed lockfile, deliberate upgrades.

Targeted ignore_changes with comments.

lifecycle {
  ignore_changes = [
    # autoscaler manages this
    desired_capacity,
  ]
}

Clean module interfaces. Inputs declared with types. Outputs documented. README in the module describing what it creates and what to configure.

Plan output reviewed in PRs as a first-class artifact. An experienced team’s PR template includes “paste the terraform plan output here.”

A side-by-side example

The beginner version:

# main.tf
provider "aws" {
  region = "us-east-1"
}

resource "aws_instance" "server" {
  count         = 3
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t2.micro"
}

resource "aws_security_group" "web" {
  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

The experienced version:

# versions.tf
terraform {
  required_version = ">= 1.5"
  required_providers {
    aws = { source = "hashicorp/aws", version = "~> 5.0" }
  }
  backend "s3" {
    bucket         = "myorg-tfstate-prod"
    key            = "envs/prod/app/terraform.tfstate"
    region         = "eu-west-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}

# providers.tf
provider "aws" {
  region = var.region
  default_tags {
    tags = local.common_tags
  }
}

# variables.tf
variable "region" {
  type        = string
  description = "AWS region for this environment"
}

variable "environment" {
  type        = string
  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "environment must be dev, staging, or prod."
  }
}

variable "servers" {
  type = map(object({
    instance_type = string
    purpose       = string
  }))
}

# locals.tf
locals {
  common_tags = {
    Environment = var.environment
    ManagedBy   = "terraform"
    Repo        = "infra"
  }
}

# data.tf
data "aws_ami" "amazon_linux" {
  most_recent = true
  owners      = ["amazon"]
  filter {
    name   = "name"
    values = ["al2023-ami-*-x86_64"]
  }
}

# main.tf
resource "aws_instance" "server" {
  for_each      = var.servers
  ami           = data.aws_ami.amazon_linux.id
  instance_type = each.value.instance_type
  tags = {
    Name    = "${var.environment}-${each.key}"
    Purpose = each.value.purpose
  }
  vpc_security_group_ids = [aws_security_group.web.id]
}

resource "aws_security_group" "web" {
  name_prefix = "${var.environment}-web-"
  description = "Allow HTTP/HTTPS to web tier"
  vpc_id      = data.aws_vpc.main.id
  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_vpc_security_group_ingress_rule" "web_http" {
  for_each          = toset(["80", "443"])
  security_group_id = aws_security_group.web.id
  from_port         = each.key
  to_port           = each.key
  ip_protocol       = "tcp"
  cidr_ipv4         = "0.0.0.0/0"
  description       = "HTTP/HTTPS from anywhere"
}

The differences aren’t stylistic — they’re structural. State backend is real. Versions are pinned. Variables have types and validations. AMI is looked up dynamically (no surprise breakage when AWS retires the AMI). Resources use for_each with stable names. Tags are centralized. The security group is split from its rules (no cycle on future cross-references). create_before_destroy is set on the SG so it can be replaced safely. Every choice answers a question the beginner version doesn’t even know to ask.


13. Where to Go Deeper

A short, opinionated list. Skip the rest of the internet for a year and read these.

Terraform: Up & Running by Yevgeniy Brikman (Gruntwork). The book. Get the latest edition (4th or later — it covers the modern features including moved/import/removed blocks). Brikman’s Gruntwork blog series is the same material in long form, free; the book is more polished. Best for: comprehensive coverage with production focus.

The official Terraform documentation, specifically the Internals section. https://developer.hashicorp.com/terraform/internals covers the dependency graph, state, and the plan/apply lifecycle in HashiCorp’s own words. Short, authoritative, easy to miss. Best for: deepening Idea 2 once you’re comfortable with the basics.

The HashiCorp 2016 talk “Applying Graph Theory to Infrastructure as Code” (Mitchell Hashimoto). YouTube. Old, still relevant — the graph engine is the same. Best for: when Idea 2 hasn’t fully clicked.

The Terraform GitHub repo, specifically docs/architecture.md. https://github.com/hashicorp/terraform/blob/main/docs/architecture.md. Reading the source’s architecture document (and then the source itself, if you’re curious) is the cleanest way to internalize how Core, providers, and state interact. Best for: the engineer who needs to write a custom provider, or who suspects a Core bug.

Terragrunt by Gruntwork. A wrapper around Terraform that addresses many of the multi-environment / multi-state / DRY problems Terraform alone makes painful. You don’t have to use it — but reading what it solves teaches you what’s annoying about plain Terraform. Best for: after you’ve felt the pain of Terraform at scale.

OpenTofu documentation and changelog. https://opentofu.org/docs/. Especially the state encryption and provider for_each feature pages. Best for: understanding where the open-source side of the ecosystem is going in 2026 and beyond.

Spacelift, env0, or Scalr blog posts on production Terraform patterns. All three companies publish high-quality content from people running Terraform at scale across many customers. Their incentive is to attract customers, but the actual engineering content is typically sound. Best for: applied patterns and edge cases you won’t find in HashiCorp’s docs.

A real codebase. Go read the open-source Terraform code for a major project — Cloudflare’s terraform-aws-eks-cluster, Gruntwork’s modules on GitHub, the HashiCorp Vault operator’s Helm charts. Reading other people’s mature Terraform — what they parameterize, how they structure modules, where they comment — is more valuable than any tutorial. Best for: developing taste.


14. The Final Verdict

Terraform is the ugly, indispensable plumbing of the modern cloud. It is not elegant. HCL is not a beautiful language. The state file model is fundamentally fragile. The error messages are sometimes baffling. Refactoring a large codebase will make you old. After a decade of polish, the rough edges that remain are structural — they exist because of design decisions made in 2014 that aren’t getting redone.

And yet: there is no better tool for what it does. Multi-cloud, multi-provider, declarative, with a real preview step before you press the button, with the largest provider ecosystem in software, with a workflow that actually fits Git and pull requests. Pulumi is more elegant if you want to write infrastructure in Python. CDK is friendlier if you’re 100% AWS. CloudFormation is more native to AWS. Newer infrastructure-from-code tools promise to eliminate the IaC layer entirely. None of them have anything close to Terraform’s reach, ecosystem, or community. When you need to stand up a real production environment that touches half a dozen vendors, Terraform is what you reach for, and you reach for it because every other competent engineer has reached for it before.

What Terraform got profoundly right, in two things: the plan as a first-class object, and the provider-plugin architecture. The plan turned cloud changes into reviewable diffs and made code review of infrastructure possible — that single design decision is responsible for more of Terraform’s success than HCL or HashiCorp’s marketing combined. The provider architecture meant the rest of the world could write integrations without HashiCorp’s permission, and the result is the only IaC tool with serious coverage of every cloud and most SaaS APIs. Both decisions look obvious in retrospect; neither was, in 2014.

What it gets wrong is the cost of state. The state file is a single, fragile, irreplaceable artifact, and every team above a certain size eventually feels this — through corruption incidents, through lock contention, through hours of state surgery to refactor a module that grew without a plan. The “infrastructure-as-code” model promised reproducibility, but state means a Terraform-managed cloud is not reproducible from code alone. You need code plus state, and state is precious in a way code isn’t. Tools downstream of Terraform — TACOS platforms, drift-detection layers, infrastructure-from-code alternatives — all exist to paper over this. None of them eliminate it.

Reach for Terraform when you have multi-vendor infrastructure (which is approximately every production environment built after 2018), when more than two engineers will touch it, when you need an audit trail of changes, and when the organization has any tolerance for a learning curve. Reach for something else when you’re 100% AWS and want a real programming language (CDK), when your team aggressively rejects any new DSL (Pulumi), when your problem genuinely is just “deploy a microservice on a managed runtime” and you’re better served by a higher-level abstraction (App Runner, Cloud Run, Vercel, Encore), or when you’re a regulated platform vendor where the BSL license is a nontrivial legal hazard (OpenTofu).

The two beliefs you should walk away with: First, the state file is the system. Every weird behavior, every gotcha, every architectural tradeoff in Terraform is a consequence of how state is designed. Once you genuinely internalize this — not as something you’ve read but as the first thing that pops to mind when something breaks — you stop being confused by Terraform. Second, the plan is the artifact, not the code. Code review is necessary but insufficient; the plan is what’s actually going to happen. A team that reads plans carefully has different incidents than a team that doesn’t. Put the plan output in your PRs. Make people read it. Treat unexpected destroys as production fires before they become production fires.

The hard-won line, paid for in incidents: Terraform doesn’t make mistakes. It does exactly what state and code together tell it to. The danger is in the gap between what you think you said and what you actually said, and the only thing standing in that gap is terraform plan. Read it.


The ideas are mine. The writing is AI assisted