Infrastructure as Code

IaC Foundations

A practical introduction to defining infrastructure as code, keeping state aligned, and reviewing changes before they reach production.

Infrastructure as Code (IaC) is the practice of defining and managing infrastructure through version-controlled, machine-readable configuration files rather than manual processes. It solves three persistent problems in infrastructure management: snowflake servers (systems that exist only in one engineer's memory and are irreplaceable), undocumented manual changes (configuration that drifted from anything written down), and unrepeatable environments (the staging and production environments that behave differently for reasons no one can fully explain). In DevSecOps, IaC is essential because it brings infrastructure changes into the same review, audit, and testing discipline applied to application code.

Learning objectives

What you should be able to do after reading.
  • Explain desired state, drift, plan, and apply in a delivery workflow.
  • Describe how modules and state support reuse and control.
  • Recognize the review steps that keep environments predictable.

At a glance

Fast mental model before you dive in.
Model
  • Desired state
  • Drift
  • Plan vs apply
Building blocks
  • Modules
  • State
  • Environments
Operating habits
  • Review before apply
  • Versioned changes
  • Clear ownership

Core idea

IaC tools like Terraform, Pulumi, and AWS CloudFormation use a declarative model. You describe the desired end state of the infrastructure, and the tool calculates the steps needed to reach that state from wherever the infrastructure currently is. This is different from an imperative or procedural approach (like a shell script) that describes each step to take. The declarative model means the same configuration can be applied repeatedly and produces the same result. This is idempotency, and it is the property that makes IaC reliable in automated pipelines.

The collaboration benefit of IaC is significant. When infrastructure is defined as code, it can be reviewed in a pull request like any other code change. A teammate can ask 'why is this security group open on port 443 to 0.0.0.0/0?' before the resource is created. This is far more effective than discovering the misconfiguration in a cloud security audit after it has been running for months. Code review for infrastructure is not overhead. It is one of the most cost-effective security controls available.

The mental shift IaC requires is to see manual changes to infrastructure, clicking in the cloud console, running ad hoc CLI commands, as exceptions that create technical debt rather than as normal operations. Every manual change is a change that is not in version control, not reviewed, not tested, and not automatically reproducible. Organizations that adopt IaC but continue to make manual changes as a shortcut accumulate the worst of both worlds. They have an IaC codebase that is slowly drifting from reality.

IaC is particularly powerful in DevSecOps because it enables infrastructure security policy to be enforced automatically. Security rules like 'all storage must be encrypted at rest,' 'no internet-facing resources without explicit justification,' and 'all IAM policies must follow least privilege' can be expressed as IaC scanning policies that run in the pipeline before any apply happens. This converts security guardrails from aspirational documentation into enforced code.

Workflow

  • Define infrastructure in code and keep it in version control alongside the applications that depend on it.
  • Always run a plan to preview changes and review the output before applying. Understand what will be created, modified, or destroyed.
  • Use review gates for high-impact changes, any apply that destroys resources, modifies access controls, or touches production should require explicit approval.

Baseline

  • Treat every manual console change as an incident requiring a follow-up IaC commit to bring the code back in sync with reality.
  • Keep environment-specific values (account IDs, resource names, size configurations) in separate variable files. The core infrastructure logic should be reusable across environments.
  • Store state in a remote, locked backend and treat it as a sensitive operational artifact that requires the same access controls as production credentials.

Signals to watch for

Patterns worth investigating further.
  • The deployed environment no longer matches the code.
  • People keep making urgent changes directly in the cloud console.
  • The same configuration behaves differently across environments without a clear reason.

DEEP DIVE

Desired state

Desired state is the target description of what the infrastructure should look like, which resources should exist, what their configuration should be, and how they should relate to each other. IaC is strongest when the desired state is completely specified in code, when reading the repository gives a complete picture of what should be running, not just a partial one supplemented by tribal knowledge, wiki pages, and 'ask Alex.'

The declarative model means the IaC tool is responsible for figuring out how to get from the current state to the desired state. If you add a new resource definition, Terraform creates it. If you remove a resource definition, Terraform destroys it. If you change a property, Terraform updates it. Or, for immutable resources, destroys the old one and creates a new one. The tool handles the transition logic, the operator handles the intent.

The difference between Terraform (desired state) and Ansible (procedural, though Ansible can be made idempotent) reflects different philosophies. Terraform explicitly models the gap between current and desired state using a state file, then plans and applies a minimal change set. Ansible runs tasks in order and relies on task idempotency (using the 'creates' conditional, idempotent modules, etc.) to prevent duplicate operations. Both can achieve infrastructure-as-code discipline, but Terraform's model makes it more transparent about what will change before it changes.

Immutable infrastructure is a design philosophy that extends desired state to its logical conclusion. Instead of updating a running resource (modifying a server's configuration in place), you replace it with a new one built from the updated definition. This eliminates an entire class of configuration drift problems, a resource that is never mutated cannot accumulate ad hoc changes. Container-based architectures make immutable infrastructure practical for application workloads, for persistent databases and stateful services, mutable in-place updates are often still necessary.

Drift

Drift is the divergence between what the IaC code says should exist and what actually exists in the cloud environment. It can be caused by manual changes (an engineer edits a resource in the cloud console to fix an urgent problem), external automation (another system modifies a resource managed by IaC), provider-side changes (the cloud provider updates a default setting), or infrastructure that was created outside the IaC workflow entirely. Small amounts of drift are normal, unchecked drift accumulates into a state where the IaC code no longer describes reality.

Drift is a security risk because it means there is infrastructure configuration that is not in version control, not reviewed, not subject to IaC security scanning, and potentially not known about by the people responsible for the system. A security group rule added manually to unblock a specific engineer's work may introduce exposure that was never intentionally designed, documented, or audited. Drift detection, running a plan against all environments regularly and alerting when there is unexpected drift, is a detective control for this class of problem.

When drift is detected, the response requires judgment. Sometimes the live state is correct and the code needs to be updated (someone made a legitimate manual change under time pressure and needs to follow up with a code commit). Sometimes the code is correct and the live state needs to be reverted (the manual change was unauthorized or mistaken). Sometimes neither is clearly correct and the team needs to make a deliberate decision about which state to converge toward. The important thing is that the decision is made explicitly rather than the divergence continuing indefinitely.

The organizational cause of drift is usually that the IaC workflow feels slower or more cumbersome than a direct console change. Engineers who are blocked by a slow pipeline, a complex module, or a configuration they do not understand will reach for the console. Reducing drift requires both technical improvements (faster pipelines, clearer documentation, better module interfaces) and cultural ones (treating manual changes as exceptional rather than normal, and following up with code commits).

Plan vs apply

The plan step generates a preview of the changes that would be made if the configuration were applied to the current infrastructure state. Terraform's plan output shows every resource that would be created (+), modified (~), or destroyed (-), along with the specific property changes for each modified resource. This output is the primary mechanism for human review before any infrastructure changes are made. It answers the question 'what is about to happen?' before it happens.

Reviewing a plan output is a skill that develops with experience. Reviewers should look for. Unexpected resource replacements (a modification that requires destroying and recreating a resource, indicated by the -/+ symbol in the plan), large numbers of affected resources (suggesting the change is broader than intended), changes to security-sensitive resources (IAM policies, security groups, encryption settings, public access flags), and changes to production state that should have been tested in staging first.

In automated CI/CD pipelines, the plan runs automatically on every pull request and its output is posted as a PR comment. This allows the reviewer to see exactly what infrastructure changes the PR includes, not just the diff in the IaC code. The apply step then runs only after the PR is approved and merged to the protected branch. This two-step model enforces the review before apply principle even in a fully automated workflow.

A subtle pitfall in plan-based workflows is that the plan is a snapshot of the expected change at a specific moment in time. If another change is applied between the time the plan was generated and the time apply runs, the actual change may differ from what was reviewed. Remote plan execution with lock files (Terraform Cloud, Atlantis) addresses this by ensuring the plan and apply run atomically with a state lock, preventing concurrent modifications from producing unexpected results.

Modules

A module is a reusable unit of IaC configuration. Modules encapsulate a set of resources and their relationships behind a defined interface. Callers specify inputs (variables) and receive outputs without needing to know the implementation details. A 'vpc' module might accept the CIDR range, the number of availability zones, and whether to enable private subnets, and handle all the subnets, routing tables, internet gateways, and NAT gateways internally.

Module versioning is a critical operational practice. Modules published to the Terraform Registry or to an internal artifact store should be versioned with semantic versioning, and callers should pin to a specific version rather than using a floating reference. An unpinned module reference means that the next time Terraform initializes, it might pull a newer version of the module that has different behavior. The infrastructure-as-code equivalent of unpinned package versions in application code.

The interface design principle for modules is expose only the inputs the caller must control, hide the implementation details that should stay internal. A module that requires the caller to provide dozens of low-level parameters is not providing meaningful abstraction. It just reorganizes where the complexity lives. A well-designed module has a small, stable interface that allows callers to focus on what they want rather than how to configure it.

Testing modules requires tools beyond Terraform itself. Terratest (Go-based) allows writing integration tests that apply a module to a real cloud environment and verify the resulting infrastructure using assertions. Checkov and OPA policies can be used for unit-style policy tests that check module outputs without deploying. The investment in module testing pays off quickly in organizations with many teams using the same modules, a bug in a widely-used module affects every team that uses it.

State

The Terraform state file is the record that links the resources defined in code to the real resources in the cloud. It stores the resource IDs, configuration attributes, dependencies, and metadata that Terraform needs to correctly calculate what changes are needed on the next apply. Without the state file, Terraform cannot know which cloud resources it manages. Running plan or apply against an environment without state would attempt to create everything from scratch.

State files contain sensitive information. Resource IDs, ARNs, IP addresses, and sometimes actual sensitive values (like database passwords passed as variables) appear in state. Access to the state file gives significant information about the infrastructure. For this reason, state must be stored in a secure remote backend, an S3 bucket with encryption, versioning, and restricted access. Terraform Cloud, or a similar managed backend, not in a local file on a developer's machine or in version control.

State locking prevents concurrent Terraform operations from corrupting the state. If two people or two CI jobs run terraform apply simultaneously against the same state, they can produce conflicting changes and leave the state in an inconsistent condition. Remote backends that support locking (S3+DynamoDB, Terraform Cloud) acquire a lock before any operation that modifies state and release it after completion. A stale lock from a crashed or interrupted operation can block future runs and may require manual clearing.

Manual state manipulation (terraform state mv, terraform state rm, terraform import) is sometimes necessary but always risky. These commands directly modify the state file in ways that cannot be undone with a simple apply, and mistakes can leave the state out of sync with reality in ways that are difficult to diagnose. Any manual state operation should be preceded by a state backup (terraform state pull > backup.tfstate), performed carefully, and followed by a plan to verify the state is correct before any further applies.

Environments

IaC environments represent distinct deployments of the same infrastructure configuration with different variable values, access controls, and risk tolerances. The two main patterns for managing multiple environments in Terraform are workspace-based (using terraform workspace to maintain separate state files for each environment within the same configuration) and directory-based (separate directories for dev, staging, and prod, each with their own backend configuration and variable files). The directory-based approach is generally preferred for production use because it makes environments more explicit and easier to apply different access controls per environment.

Variable files (terraform.tfvars, or environment-specific files like prod.tfvars) provide the environment-specific values without changing the core infrastructure logic. A production environment might use a larger instance type, more availability zones, stricter network policies, and different CIDR ranges than development, but the same module configuration. Keeping the architecture consistent while varying only the values that need to differ is the ideal that makes environments comparable and easier to reason about.

Blast radius isolation between environments requires separate state, separate cloud accounts or projects (where possible), and separate access controls. If development and production share the same cloud account, a badly scoped IAM policy in the development environment's CI could potentially affect production resources. Separate accounts per environment, the AWS Organizations or GCP Folder model, provide a hard boundary that no IAM policy misconfiguration can cross.

Access control for IaC environments should reflect the different risk levels. Development environments might allow developers to run terraform apply directly. Staging environments should be applied only through CI with human approval for significant changes. Production environments should be applied only through CI with required approvals and a protected branch gate, and no human should have direct apply permissions, all production changes flow through the reviewed, audited pipeline.

Change review

Reviewing an IaC change requires understanding both what the code change does and what the resulting plan output means. The code diff shows intent, the plan output shows consequence. A reviewer who reads only the code diff and misses that a property change will cause Terraform to destroy and recreate a production database has not actually reviewed the change. The plan output is a mandatory input to a meaningful IaC review.

Automated policy enforcement with tools like Sentinel (Terraform Cloud's policy framework), OPA (Open Policy Agent), or Conftest brings consistency to change review. Instead of relying on human reviewers to remember every security requirement, policies encode those requirements as machine-checkable rules that run on every plan output. A policy that says 'no S3 bucket may have ACL set to public-read-write' will catch that misconfiguration on every PR regardless of reviewer experience or attention.

The distinction between reviewing syntax (linting) and reviewing intent is important for IaC. Linting and SAST for IaC catch formatting issues, deprecated syntax, and known misconfiguration patterns. Intent review requires asking. Is this the right change? Is the scope bounded? Could this have unintended effects on other resources? Is this the least-privileged implementation? Are there simpler alternatives? These questions require human judgment and domain knowledge that automated tools cannot fully replace.

Common dangerous patterns to look for in IaC reviews include. IAM policies with Action '*' or Resource '*' (wildcard permissions), security groups with ingress 0.0.0.0/0 on sensitive ports, storage resources with public access or public ACLs, encryption settings set to false, resource deletions without corresponding data migration plans, and changes to shared modules or base configurations that affect many environments simultaneously. Any of these in a PR should trigger a higher bar of review and explicit approval from a security-aware team member.