Docker Deep Intuition — Deep Tech Intuition

1. One-Sentence Essence

Docker is a packaging format for processes — it doesn’t virtualize hardware, it isolates a single process’s view of the operating system using kernel features that have existed in Linux for over a decade.

Read that twice. Almost every misconception about Docker dissolves once you internalize this. A container is not a small VM. It is not a sandbox in the security sense. It is not a runtime. It is a regular Linux process — the same kind that’s been running on Linux since forever — that has been started with a few extra system calls so that it sees a different filesystem, a different process tree, and a different network than the processes around it. That’s it. The “container” is not a box around the process; it’s a particular configuration of the process. When you stop a container, you don’t shut down a machine — you kill a process.

Hold this in your head as you read everything below.

2. The Problem It Solved

In 2012, deploying server software was painful in a way that’s hard to remember now. You’d write code on your laptop running Ubuntu 12.04. You’d push it to a staging server running Ubuntu 14.04. The Python version was different. A C library was older. An environment variable was missing. Six configuration files lived in /etc that nobody had documented. The dependency graph existed entirely in tribal knowledge. “Works on my machine” was a meme because it was true every single day.

The standard answers were all unsatisfying. Virtual machines (VMware, KVM) solved the consistency problem — package the whole OS, ship the OS — but at brutal cost: each VM was multiple gigabytes, took 30+ seconds to boot, ate hundreds of megabytes of RAM just to host an idle Linux kernel, and required a hypervisor. You could not realistically run twenty small services on one box. Configuration management tools (Chef, Puppet, Ansible) tried to converge a real machine to a known state, but convergence is not the same as reproducibility — you’d still hit “it converged on this machine but not that one” bugs constantly. Static linking helped for a single binary but didn’t help with the surrounding mess of config files, runtime dependencies, and shell environment.

Solomon Hykes and the team at dotCloud — a struggling Platform-as-a-Service company — had been running customer code in Linux containers (LXC) since around 2008 because they needed dense multi-tenant hosting and VMs were too expensive. LXC worked, but it was unstable user-space tooling on top of relatively new kernel features (namespaces, cgroups). They built internal tools to make LXC tolerable, layered a packaging format on top, and added a primitive image-distribution mechanism. In March 2013, after a chaotic PyCon lightning talk, they open-sourced that internal tool as Docker.

The genius wasn’t the technology. Linux namespaces had existed since 2008. Cgroups had existed since 2007. LXC had been around for years. The genius was packaging. Docker took those obscure kernel features and wrapped them in three things developers already understood: a Dockerfile (a recipe), an image (a tarball), and a registry (Git for those tarballs). It made docker run nginx work. The actual primitives were old; the developer experience was new. That is the whole story of why Docker won.

3. The Concepts You Need

Docker introduces a vocabulary you must own before any of the deeper sections will land. Group them into four clusters.

Kernel primitives (what Docker is built from)

Namespace — A Linux kernel feature that gives a process an isolated view of some global resource. There are several kinds: PID (own process tree, your process is PID 1), mount (own filesystem view), network (own network interfaces, routing tables, iptables rules), UTS (own hostname), IPC (own shared memory), user (own UID/GID mapping), and cgroup. When Docker “creates a container,” it’s calling clone() or unshare() with these namespace flags set.
cgroup (control group) — A separate kernel feature that limits and accounts for resource usage by a group of processes: CPU shares, memory ceiling, block I/O, etc. Namespaces give you isolation; cgroups give you resource limits. Both are needed to make a container safe.
Union filesystem (OverlayFS) — A filesystem that presents a stack of read-only directories with a single read-write directory on top, all merged into one apparent directory tree. Reads check top-down; writes go to the top layer; deletes leave a “whiteout” marker. This is how Docker images become layers.
Copy-on-Write (CoW) — Don’t copy a file until something tries to write to it. OverlayFS does this: when a container modifies a file from a read-only lower layer, the kernel copies the file up to the writable layer and modifies it there.

Image and packaging concepts

Image — A read-only stack of filesystem layers plus some metadata (entrypoint, env vars, exposed ports, default user). It’s literally a directory of tarballs and a JSON manifest. Images are immutable; you don’t modify them, you build new ones.
Layer — A single set of filesystem changes (files added, modified, deleted). Each instruction in a Dockerfile (roughly) creates one layer. Layers are content-addressed by SHA256 and shared across images, so if ten images use the same Ubuntu base, you store it once.
Container — A running (or stopped) instance of an image. It’s an image + a thin writable layer on top + a configuration (env vars, port mappings, mounts, cgroup limits) + a process. When you docker run, Docker pulls the image, creates a writable layer, sets up namespaces and cgroups, and execs your process inside them.
Dockerfile — A text file describing how to build an image. Each FROM, RUN, COPY, ENV, etc. is an instruction that becomes a layer (or modifies metadata).
Registry — A server that stores and distributes images. Docker Hub is the public default. Most teams run their own (ECR, GCR, Artifactory, Harbor).
Tag — A human-readable name for an image version: nginx:1.27, myapp:latest. Tags are mutable pointers to image digests. The actual immutable identifier is the sha256:... digest.

Runtime and orchestration concepts

Docker Engine / dockerd — The long-running daemon that does the actual work. Your docker CLI is a thin client that talks to it over a Unix socket (/var/run/docker.sock) or TCP.
containerd — A lower-level daemon that Docker uses internally to actually manage the container lifecycle. Docker Engine increasingly delegates to containerd. Kubernetes also uses containerd directly, bypassing Docker entirely.
runc — The lowest level still: an OCI-compliant tool that takes a config file and a filesystem and actually makes the namespace/cgroup syscalls. containerd calls runc.
OCI (Open Container Initiative) — The standard that defines what a container image and a container runtime are, independently of Docker. This is why Podman, Kubernetes, and other tools can use Docker images.
Bind mount — A directory on the host filesystem mounted directly into a container. Changes are visible bidirectionally and instantly.
Volume — A managed piece of storage that Docker creates and tracks (lives in /var/lib/docker/volumes/), mounted into a container. Survives container deletion. Preferred for production data.
Network (Docker network) — A virtual network that containers attach to. Default modes are bridge (NAT’d virtual network), host (share host’s network stack — no isolation), and none (no networking at all).

Operational concepts

Docker Compose — A tool for defining and running multi-container apps on a single host using a YAML file. Not an orchestrator in any serious sense — it does not span machines.
Docker Desktop — The macOS/Windows GUI app that runs a hidden Linux VM (containers need a Linux kernel) and proxies the docker CLI to it. On Linux, no VM is needed.
PID 1 — The first process in a process namespace. Special: the kernel won’t deliver default signals to it unless it has explicitly registered handlers. This is the source of one of Docker’s most common production gotchas. (See Section 7.)

These terms will come up over and over. The Mental Model in Section 5 is built directly on top of them.

4. The Distilled Introduction

This section replaces the 10 hours of YouTube tutorials. Read it in order. By the end, you will be able to install Docker, build images, run containers, network them together, persist data, and deploy a multi-container app — and you’ll understand why you’re typing what you’re typing.

Setup

On Linux, install via your package manager (apt install docker.io on Debian/Ubuntu, or follow Docker’s official repo for the latest version). Add your user to the docker group so you don’t need sudo:

sudo usermod -aG docker $USER
# log out and back in
docker run hello-world

That docker group, by the way, is effectively root. Anyone in it can mount the host filesystem into a container as root. Treat membership as you would sudo access. This isn’t a side-note — it’s a fundamental fact about Docker’s architecture (we’ll come back to it in Downsides).

On macOS or Windows, install Docker Desktop. Behind the scenes it spins up a small Linux VM (because containers need a Linux kernel; macOS and Windows don’t have one). Your docker commands talk to that hidden VM. This matters: bind-mount performance is dramatically slower on macOS/Windows than on Linux because every file access crosses a VM boundary. If you’re on a Mac and your app feels slow, this is probably why.

Your first container

docker run --rm -it ubuntu:24.04 bash

Walk through what happened: Docker checked your local cache for an image tagged ubuntu:24.04, didn’t find it, pulled it from Docker Hub (a few hundred MB of layers), set up a new mount namespace with the Ubuntu rootfs, a new PID namespace (your shell is PID 1 inside), a new network namespace (with a virtual interface bridged to the host), and exec’d bash inside. The -it flags allocated a TTY and kept stdin open. The --rm says “delete this container when the process exits.”

When you type exit, the bash process dies, the container goes away, and the writable layer is discarded. The image is still cached. Run it again — it starts in milliseconds because nothing has to be downloaded.

Building your own image

Make a directory, drop in a Dockerfile:

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "app.py"]

docker build -t myapp:0.1 .

What happens: Docker reads the Dockerfile, executes each instruction in a temporary container, and snapshots the result as a layer. FROM python:3.12-slim pulls the base image. WORKDIR /app sets the working directory. COPY requirements.txt . copies one file into the image. RUN pip install... runs pip inside a temporary container, captures the resulting filesystem changes, and stores them as a layer. COPY . . copies your source. CMD sets the default command — it’s metadata, not a layer.

Notice the order. requirements.txt is copied before the rest of the source. This is deliberate. Docker caches each layer by hashing its inputs. As long as requirements.txt hasn’t changed, the layer that runs pip install is reused — even if you changed every other file. This is the single most important Dockerfile discipline: put slow-changing things first, fast-changing things last. We’ll return to this in Section 12.

Running and managing containers

docker run -d --name web -p 8080:80 nginx:1.27

-d runs detached (background). --name gives it a stable name instead of a random one. -p 8080:80 publishes container port 80 to host port 8080. Now curl localhost:8080 hits nginx.

Useful day-one commands:

docker ps                         # running containers
docker ps -a                      # all containers, including stopped
docker logs web                   # stdout/stderr from the container
docker logs -f web                # follow logs
docker exec -it web bash          # shell inside a running container
docker stop web                   # send SIGTERM, then SIGKILL after 10s
docker rm web                     # delete a stopped container
docker rm -f web                  # stop and delete in one go
docker images                     # local images
docker rmi myapp:0.1              # delete an image
docker system df                  # how much disk Docker is using
docker system prune               # delete stopped containers, unused networks, dangling images
docker system prune -a --volumes  # nuclear option: clean everything not currently in use

docker exec is your bread and butter for debugging. When something is wrong, exec into the running container and look around. It’s just a process — ps, cat /etc/, ls, env all work.

Volumes and bind mounts

Containers are ephemeral. The writable layer dies with them. For anything that needs to persist — a database, uploaded files, logs — you need either a volume or a bind mount.

# Volume: managed by Docker, lives in /var/lib/docker/volumes/
docker run -d --name pg \
  -v pgdata:/var/lib/postgresql/data \
  -e POSTGRES_PASSWORD=secret \
  postgres:16

# Bind mount: a host directory mapped into the container
docker run --rm -v "$PWD":/app -w /app node:20 npm test

The first form persists Postgres data even if you docker rm the container. Recreate it with the same volume mount and your data is still there. The second form is the typical dev pattern: your source code on the host is visible inside the container, so edits on your laptop are immediately reflected — no rebuild needed.

Two rules:

Volumes for production data. Databases, uploads, anything you care about. They’re managed, portable across container recreations, and don’t tie you to a specific host path.
Bind mounts for development and config. Source code, config files, sockets like /var/run/docker.sock. Don’t bind-mount production data — you couple yourself to host paths and filesystem types.

Networking

By default, containers attach to a virtual bridge network called bridge. They get an IP like 172.17.0.2, can reach the internet via NAT, and can reach each other by IP. They cannot reach each other by name on the default bridge — DNS is only enabled on user-defined networks. This is one of the most common stumbling blocks for new users.

# Create a user-defined network
docker network create app-net

# Run containers on it
docker run -d --name db --network app-net postgres:16
docker run -d --name api --network app-net -e DB_HOST=db myapp:0.1

Now api can connect to Postgres at hostname db — Docker’s embedded DNS resolves the container name. This is the right pattern. Always use a user-defined network for multi-container apps. The default bridge is a legacy holdover.

Other network modes:

--network host — Container shares the host’s network namespace. No isolation, no port mapping needed. Fastest, but you lose all the safety. Useful for monitoring agents, rare otherwise.
--network none — No networking at all. Useful for pure batch jobs that don’t need the network.
overlay — A Swarm/Kubernetes-era driver for cross-host networking. Outside the scope of single-host Docker.

Multi-container apps with Docker Compose

Once you have more than one container, hand-typing docker run commands gets old. Docker Compose lets you describe the whole app in a compose.yaml:

services:
  web:
    build: .
    ports:
      - "8080:80"
    depends_on:
      - db
    environment:
      DB_HOST: db
  db:
    image: postgres:16
    volumes:
      - pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_PASSWORD: secret
volumes:
  pgdata:

Then docker compose up -d brings everything up; docker compose down tears it down; docker compose logs -f tails everything; docker compose exec db psql -U postgres opens a shell. Compose automatically creates a user-defined network for the project — so web can reach db by name. This is the right way to develop and run multi-container apps on a single host.

depends_on is weaker than people expect: it controls startup order, not readiness. Compose will start db before web, but db might still be initializing when web tries to connect. Either add a healthcheck and depends_on: db: condition: service_healthy, or — better — make your app retry connections at startup. Real services need to handle this anyway.

Multi-stage builds: the production-ready Dockerfile pattern

A naive Dockerfile ships compilers, build tools, dev dependencies, and test data to production. This is wasteful (multi-GB images) and dangerous (every tool is an attack surface). Multi-stage builds fix this:

# --- builder stage ---
FROM golang:1.25 AS builder
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /app ./cmd/server

# --- runtime stage ---
FROM gcr.io/distroless/static-debian12
COPY --from=builder /app /app
USER nonroot
EXPOSE 8080
ENTRYPOINT ["/app"]

The final image contains only the static binary and a minimal base — no Go toolchain, no shell, no package manager. From hundreds of MB to ~15 MB. The attack surface collapses to almost nothing. For any compiled language and most interpreted ones too, multi-stage is the default. If your production image contains a compiler, you’ve done it wrong.

Pushing and pulling

docker tag myapp:0.1 ghcr.io/myorg/myapp:0.1
docker login ghcr.io
docker push ghcr.io/myorg/myapp:0.1

On another machine:

docker pull ghcr.io/myorg/myapp:0.1
docker run -d ghcr.io/myorg/myapp:0.1

This is the whole point. Build once, run anywhere with a Docker daemon. The image is the artifact; the registry is the distribution channel.

What you can now do

You can build images, run containers, persist data, network them, compose multi-container apps, and push them to a registry. You know enough Docker to ship a real application. The next sections explain why what you’ve just learned actually works the way it does — and how to think about it when things get weird.

5. The Mental Model

Four core ideas. Internalize these and Docker stops being magic.

Core Idea 1: A container is a process, not a machine.

This is the single most important thing. The kernel features Docker uses — namespaces and cgroups — operate on processes. When you run a container, you are not booting a tiny VM. You are starting a regular Linux process with some special flags that change what it can see and what resources it can use. The “container” is the namespace+cgroup configuration around that process; it is not a thing that exists when no process is running.

This predicts a lot:

A “stopped container” is not a paused machine. It’s a saved configuration plus a writable layer on disk. Restarting it just starts a new process with the same setup.
There’s no “boot time.” A container starts in milliseconds because it’s just fork+exec with extra flags. There’s no kernel to boot, no init system to run.
Your container has no kernel. It uses the host’s kernel. This is why containers are tiny and fast — and why container security relies entirely on the host kernel being secure.
“One process per container” is not a stylistic preference; it’s the design’s natural shape. If your process exits, the container exits. Run multiple processes via something like supervisord and you’ve defeated the design.
Logs are stdout/stderr of that one process. That’s the whole logging contract. If your app writes to a file, Docker doesn’t capture it. Write to stdout.

Core Idea 2: An image is a stack of immutable layers; a container is that stack plus one writable layer on top.

Images don’t change. When you “modify” an image, you build a new one. Each Dockerfile instruction creates a new layer that records the difference from the previous one. Layers are content-addressed (SHA256) so identical layers are stored once globally, regardless of how many images use them.

When you start a container, the kernel mounts those read-only layers as a stack via OverlayFS, then puts a thin writable layer on top. Reads go top-down through the stack until a file is found; writes go to the top layer; deletes leave a “whiteout” marker that hides the underlying file.

This predicts:

Layer caching is just hashing. If the inputs to a layer are unchanged, Docker reuses the cached output. This is why Dockerfile order matters obsessively: change a layer and every layer below it must rebuild.
Deleting a file in a later layer doesn’t remove it from the image. It’s still there in the lower layer; the new layer just hides it. So RUN rm secret.key doesn’t actually delete secret.key from the image — it’s still in the previous layer, accessible to anyone who pulls the image. This is a constant source of leaked secrets.
Containers from the same image share storage. Run 100 containers of nginx; the nginx layers are stored once. Each container’s writable layer is independent.
The writable layer is slow. Writing to a file in a lower layer means the kernel must “copy up” the entire file to the writable layer first, then modify it. For databases or anything write-heavy, never use the writable layer — mount a volume.
Images are tarballs, not VMs. You can docker save an image to a tarball, scp it, and docker load it elsewhere. There’s nothing magical about the registry; it’s just the convenient distribution mechanism.

Core Idea 3: Isolation is selective, not total.

A container is isolated from the host along certain axes (filesystem view, process view, network view, hostname) and not along others. Time is shared. The kernel is shared. Block devices are (usually) shared. Hardware is shared. Memory is shared except for the cgroup limit you set.

This predicts:

A kernel exploit in one container is a kernel exploit in all containers and the host. The kernel is the trust boundary, and there’s only one of it.
Containers cannot run a different OS than the host’s kernel. “Linux containers on Windows” means a hidden Linux VM is running on Windows; the containers run inside that VM’s Linux kernel. Docker Desktop hides this from you, but it’s there.
A container without resource limits can OOM the host. The default cgroup config gives a container access to all available memory and CPU. A runaway container can take down everything else, including the host. Always set --memory and --cpus in production.
File ownership matters across the boundary. If a container runs as root (UID 0) and bind-mounts a host directory, files it creates are owned by root on the host too. This is a constant source of “why can’t I delete these files my container made” bugs.
Containers are not a security boundary you should rely on for adversarial code. They are an isolation mechanism for cooperating code. For untrusted workloads, use VMs or specialized runtimes (gVisor, Kata Containers, Firecracker).

Core Idea 4: The Docker daemon owns everything; the CLI is a thin client.

When you type docker run, your CLI doesn’t do the work. It sends a JSON request to dockerd over a Unix socket. The daemon is the long-running process that pulls images, sets up namespaces and cgroups, manages the container lifecycle, captures logs, etc. The daemon runs as root.

This predicts:

Docker socket access = root on the host. Anyone who can talk to /var/run/docker.sock can mount / from the host into a container as root. Bind-mounting the socket into a container (“docker-in-docker”) is therefore equivalent to giving that container root on the host. Treat socket access like sudo.
The daemon is a single point of failure. If dockerd crashes or hangs, every container on that host is affected. Tools that depend on the API (your CI, your monitoring) break too. This is one of the fundamental design weaknesses Podman tries to address by going daemonless.
Docker doesn’t run on your laptop. On macOS and Windows, the daemon runs in a hidden Linux VM. Your CLI talks to it over a network. Bind-mounted files cross the VM boundary, which is why file watching and I/O performance are so much worse on Mac/Windows than on Linux.
You can drive Docker from anywhere. Set DOCKER_HOST=ssh://user@server and your local CLI controls a remote daemon. This is also why exposing the API over TCP without TLS is catastrophic — anyone on the network gets root.

These four ideas are the spine. Every gotcha, every judgment call, every failure mode in the rest of this document is a direct consequence of one of them.

6. The Architecture in Plain English

When you type docker run -p 8080:80 nginx, here’s what happens, end to end.

Step 1 — CLI to daemon. Your docker CLI parses the command, builds a JSON payload describing what you want, and POSTs it to the Docker daemon over /var/run/docker.sock. The CLI doesn’t know how to make a container; the daemon does.

Step 2 — Image resolution. The daemon checks its local image cache for nginx:latest. Not there? It contacts Docker Hub (registry-1.docker.io), authenticates if needed, and downloads the image manifest — a JSON document listing the layers by SHA256. For each layer it doesn’t already have, it downloads the layer (a gzipped tarball), verifies the digest, and extracts it into /var/lib/docker/overlay2/<layer-id>/. Layers it already has are reused.

Step 3 — Filesystem assembly. The daemon hands off to containerd, which hands off to the OverlayFS storage driver. OverlayFS mounts the layer directories as a stack — lowerdir=layer1:layer2:... — with a fresh empty upperdir (the container’s writable layer) and a workdir (kernel scratch space). The result is mounted at a merged directory: a single unified filesystem view. Inside, you see what looks like a complete Linux rootfs.

Step 4 — Network setup. The daemon creates a network namespace for the container. It creates a veth pair — two virtual ethernet interfaces, one in the host network namespace and one moved into the container’s namespace. The host-side end is attached to the docker0 bridge. The container-side end gets an IP from Docker’s IPAM (something like 172.17.0.2). For the -p 8080:80 flag, the daemon adds an iptables DNAT rule on the host so that traffic to host port 8080 is rewritten to 172.17.0.2:80.

Step 5 — Cgroup and namespace setup. containerd writes an OCI runtime config — a JSON file describing all the namespaces, mounts, env vars, capabilities, cgroup limits, and the command to run. It calls runc create, which is a small Go binary that does the actual kernel calls: it clone()s a process with the right namespace flags, sets up cgroups by writing to /sys/fs/cgroup/..., drops capabilities, sets the user, chdirs into the container’s rootfs, and calls pivot_root to make the container’s filesystem the new /.

Step 6 — Process exec. runc execs the container’s entrypoint — nginx -g "daemon off;" for an nginx image. From the kernel’s view, this is just another process. From the process’s view, it sees PID 1, an nginx-shaped filesystem, an IP address it doesn’t recognize, and an empty /proc showing only itself. It’s a normal process that thinks it’s the only thing alive.

Step 7 — Logging and lifecycle. The daemon captures the process’s stdout and stderr through pipes inherited at exec time. It writes them to a JSON-formatted log file in /var/lib/docker/containers/<id>/<id>-json.log. When you run docker logs, the daemon reads from that file. If the container exits, the daemon notices via the death of the process, marks the container as stopped, and (if --rm was set) cleans up the writable layer and removes the network configuration.

Step 8 — You stop it. docker stop nginx makes the daemon send SIGTERM to PID 1 inside the container. It waits 10 seconds. If the process hasn’t exited, it sends SIGKILL. The kernel tears down the namespaces (they’re refcounted; when the last process in a namespace dies, the namespace is destroyed). The veth pair is removed; the iptables rule is cleaned up; the OverlayFS mount is unmounted; the writable layer remains on disk (or is deleted if --rm was set).

Where state lives, summarized:

Images: /var/lib/docker/overlay2/<layer-id>/diff/ (one directory per layer)
Container writable layers: same place, treated as the topmost upperdir
Volumes: /var/lib/docker/volumes/<name>/_data/
Container metadata, logs: /var/lib/docker/containers/<id>/
Network state: kernel iptables rules + Linux bridge interfaces
Daemon state: in-memory (rebuilt on startup from disk)

If you rm -rf /var/lib/docker, you’ve nuked everything Docker knows about. The daemon will be very confused. Containers running at the time will become orphans.

7. The Things That Bite You

These are the gotchas that consume the first six months of using Docker. Each is a direct consequence of the mental model from Section 5.

7.1 The PID 1 problem (graceful shutdown doesn’t work)

What you’d expect: docker stop cleanly shuts down your app.

What happens: Your app sits there ignoring SIGTERM. Ten seconds later, Docker SIGKILLs it. In-flight requests are dropped, files aren’t flushed, deployments are slow.

Why: The Linux kernel treats PID 1 specially. The “init” process is supposed to be sacred — if PID 1 dies, the whole system goes down — so the kernel refuses to deliver default-action signals to PID 1 unless the process has explicitly registered a handler. Docker runs your app as PID 1. If your app or the language runtime hasn’t registered a SIGTERM handler, the kernel silently drops the signal. SIGKILL still works (the kernel handles that itself), which is why docker stop eventually succeeds — but only after the 10-second timeout.

This bites you twice: once because most Dockerfiles also use shell form CMD (CMD python app.py) which puts /bin/sh -c as PID 1 instead of your app. The shell doesn’t propagate signals to children. Your app never even sees SIGTERM, even if it would have handled it.

How to handle it: Always use exec form: CMD ["python", "app.py"]. Make your app handle SIGTERM. For multi-process containers or when you can’t change the app, use tini as PID 1: ENTRYPOINT ["/usr/bin/tini", "--"] (or docker run --init). Tini is a 10KB init that proxies signals and reaps zombies. Predicted by Mental Model 1: a container is a process, and PID 1 is special.

7.2 latest is not a version

What you’d expect: nginx:latest pinning to “the current latest version” makes builds reproducible.

What happens: Your build worked yesterday and broke today because latest now points to a different image. Or worse: your production runs a different latest than your CI ran.

Why: A tag is just a mutable pointer to an image digest. latest is convention, not magic — it’s whatever the publisher last pushed there. There is no rule that latest must be stable, recent, or even the newest version.

How to handle it: Pin to a specific version (nginx:1.27.4) or, for true immutability, a digest (nginx@sha256:abc123...). In production, always pin. :latest is fine for tinkering and ephemeral local use; it has no business in a Dockerfile or compose.yaml that anyone else will run.

7.3 Secrets in layers persist forever

What you’d expect: RUN echo $SECRET > /tmp/key && use-it && rm /tmp/key removes the secret from the image.

What happens: Anyone who pulls the image can docker history and recover every layer, including the one that wrote the secret to disk. The deletion only added a whiteout in a later layer — the secret file is still in the earlier layer, fully readable.

Why: Mental Model 2. Layers are immutable. Deletion is a layer-level operation that hides, doesn’t erase.

How to handle it: Never RUN-write secrets. Use BuildKit’s --mount=type=secret so the secret is mounted only during the RUN and never enters any layer. For runtime secrets, use environment variables, secret managers, or mounted files — not bake-time injection. Treat any image whose Dockerfile touches a secret as compromised; rotate the secret and rebuild.

7.4 The default bridge has no DNS

What you’d expect: Containers on the default network can talk to each other by name.

What happens: curl http://api fails with “could not resolve host.” But it works on a network you created yourself.

Why: Docker’s embedded DNS server is only attached to user-defined networks, not the default bridge. The default bridge predates DNS and was kept this way for backward compatibility.

How to handle it: Always create a user-defined network. Compose does this automatically. The default bridge is a legacy you should ignore.

7.5 `COPY . .` invalidates everything below it

What you’d expect: You changed one line of code; only the affected layers should rebuild.

What happens: Every change to any source file forces pip install (or npm install, or whatever) to re-run, even though dependencies didn’t change.

Why: Layer caching hashes the layer’s inputs. COPY . . includes every file in the build context. Change one file, the hash changes, the cache is invalidated, and every layer below also rebuilds (because they depend on this one).

How to handle it: Copy dependency manifests separately and install dependencies before copying source. The dependency layer stays cached as long as the manifest doesn’t change.

COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build

Always order your Dockerfile from least-frequently-changing to most-frequently-changing. Use .dockerignore to exclude node_modules, .git, build outputs, and large unrelated files from the build context.

7.6 Bind-mounted files are owned by the container’s user

What you’d expect: Files my container creates in a bind-mount belong to me on the host.

What happens: The files are owned by root (or some weird UID like 999). You can’t delete them without sudo.

Why: Mental Model 3. Isolation is selective; UIDs are not isolated by default. The container ran as root (UID 0), so files it wrote are owned by UID 0 on the host’s view of the same filesystem.

How to handle it: Run the container as your user: docker run --user "$(id -u):$(id -g)" .... Or build images with a non-root USER directive matching the dev user. For Linux this works cleanly; on macOS/Windows, Docker Desktop does some translation that mostly works but occasionally surprises you. User namespaces (--userns=host vs remapping) provide a more robust solution but are rarely enabled by default.

7.7 Containers run unlimited by default

What you’d expect: A misbehaving container can’t bring down the host.

What happens: A memory leak in one container exhausts the host’s RAM. The kernel’s OOM killer fires, and it’s just as likely to kill an unrelated critical process — or your monitoring agent, or the SSH daemon — as it is to kill the offending container.

Why: Without --memory and --cpus, the cgroup is unconstrained. The container can use whatever the host has.

How to handle it: Set memory and CPU limits on every production container. docker run --memory=512m --cpus=0.5 .... In Compose, use mem_limit (Compose v2) or deploy.resources.limits. In Kubernetes, this is enforced by default via resources.limits. The difference between “container went OOM and was restarted” (acceptable) and “host went down” (not acceptable) is just a flag.

7.8 Docker socket access = root

What you’d expect: Mounting /var/run/docker.sock into a container is a clean way to let it manage other containers.

What happens: A compromise of that container is a full root compromise of the host. The container can docker run -v /:/host -it ubuntu and have full read/write of the host filesystem as root.

Why: The daemon runs as root. Anyone with API access can ask it to run a privileged container with / mounted. There’s no “limited” mode for the API.

How to handle it: Don’t mount the docker socket into containers unless you fully trust the workload. For “needs to manage containers” use cases, prefer rootless Docker, Podman (daemonless), or carefully scoped tooling. CI runners that mount the socket should treat each job as essentially a root user on the runner host.

7.9 Storage fills up silently

What you’d expect: Docker manages its own disk usage.

What happens: Three months in, your disk is full. df says /var/lib/docker is 200GB.

Why: Stopped containers, dangling images (built then re-tagged), unused volumes, and especially unbounded log files in /var/lib/docker/containers/.../*-json.log accumulate forever. Docker doesn’t garbage-collect on its own.

How to handle it: Run docker system prune periodically. Configure log rotation in /etc/docker/daemon.json:

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

In production, ship logs off the host and use a logging driver that doesn’t write locally. Monitor /var/lib/docker size like you would /.

7.10 Bind mounts on Mac/Windows are slow

What you’d expect: Bind-mounted source code on macOS performs like it does on Linux.

What happens: File watchers fire 200ms late, builds take 3x longer, occasional weirdness with symlinks and case sensitivity.

Why: Mental Model 4. On Mac/Windows, Docker runs in a hidden Linux VM. Bind mounts cross the host-VM boundary via filesystem sharing protocols (gRPC FUSE, virtiofs, SMB depending on backend). Every file read/write is synchronized across that boundary.

How to handle it: Use named volumes for node_modules, vendor/, build caches — anything that doesn’t need to be edited on the host. Enable Docker Desktop’s VirtioFS backend (faster than the older alternatives). For interpreted languages, accept the latency and prefer :cached or :delegated mount consistency flags. For heavy workloads, run Docker on a real Linux machine — even a small cloud VM — and point your CLI at it via SSH.

8. The Judgment Calls

These are the decisions experienced engineers actually agonize over. Each one has real tradeoffs, and the “right” answer depends on context that a tutorial can’t give you.

8.1 Alpine vs Debian-slim base

Situation: Picking a base image. Alpine is 5 MB; debian:slim is ~80 MB; debian:bookworm is ~120 MB.

Option A: Alpine. Smallest image, fast pulls, cleaner CVE scans. Option B: Debian-slim. Slightly larger, far better library compatibility.

What experienced engineers actually do: Default to debian:slim (or the slim variant of your language’s official image: python:3.12-slim, node:20-slim). Alpine uses musl libc instead of glibc, which periodically breaks things in subtle ways: DNS resolution in Go programs, locale handling, native Python packages that ship glibc binaries. The size savings rarely matter as much as the debugging time saved by not chasing musl-specific bugs. If your image is going to be 200 MB anyway because of your dependencies, saving 70 MB on the base isn’t worth it. Reach for Alpine when image size genuinely dominates (high-density deployments, very minimal services in pure Go) and when you’ve validated that your stack works on musl.

8.2 Volume vs bind mount

Situation: Your container needs persistent state.

Option A: Volume (-v dataname:/path). Docker manages the storage location. Option B: Bind mount (-v /host/path:/path). You specify the host path.

Choose volume when: Production data, databases, anything you want Docker to manage and back up consistently. Cross-platform (Linux/Mac/Windows) deployments. You don’t care where it lives on disk.

Choose bind mount when: Development source code (live reloading). Sharing config files from the host. Reading host data the container needs. Scenarios where another (non-Docker) tool needs to also touch the data.

The signal: Will another non-Docker process on the host read or write this data? If yes, bind mount. If no, volume. Production databases use volumes. Development directories use bind mounts. Almost no production workload should bind-mount data.

8.3 Single-stage vs multi-stage build

Situation: Building an image for an app that has a build step (compilation, bundling, transpilation).

Option A: Single stage. One FROM. Builder tools end up in the final image. Option B: Multi-stage. Separate FROM for build and runtime; COPY --from the artifact across.

What experienced engineers actually do: Multi-stage every time, except for trivial scripts. The size difference is large (often 5-10x), the security difference is larger (no compiler in production = much smaller attack surface), and the cost is just a few extra Dockerfile lines. The only place where single-stage is defensible is for development containers where you actually want the toolchain present. For production: multi-stage, distroless or alpine final stage, non-root USER.

8.4 ENTRYPOINT vs CMD

Situation: Defining what the container runs.

Option A: CMD only. Easy to override at run time. Option B: ENTRYPOINT + CMD. ENTRYPOINT is the program; CMD provides default arguments.

What experienced engineers actually do: Use ENTRYPOINT for the binary, CMD for default arguments. This makes the image behave like a binary you can pass arguments to: docker run myimage --help works naturally. Use plain CMD only when you genuinely want the user to be able to replace the entire command (e.g., a base development image where people will run shells, scripts, etc.). Always use exec form (JSON array) for both — never the shell form, because it inserts /bin/sh -c and breaks signal handling.

8.5 Compose vs Kubernetes (or Swarm, or nothing)

Situation: You’ve outgrown a single docker run. What now?

Option A: Docker Compose. Single host, simple multi-container. Option B: Kubernetes. Multi-host orchestration with scaling, healing, networking. Option C: Cloud-managed (ECS, Cloud Run, Fargate). Containers without orchestrating yourself.

What experienced engineers actually do: Stay on Compose for as long as honestly possible. Compose handles 80% of real-world workloads — small SaaS products, internal tools, dev environments, single-machine deployments — with 5% of the operational complexity of Kubernetes. The cost of Kubernetes (ongoing cluster operation, certificate management, network plugins, security policies, RBAC, the YAML sprawl) is real and should not be paid until the requirements demand it: multi-machine, autoscaling, zero-downtime deploys, multi-tenancy. If you don’t have those needs, Kubernetes is over-engineering. If you have them and you’re a small team, use a managed Kubernetes (GKE, EKS) or skip it for a managed container platform (Cloud Run, Fargate). Self-hosting Kubernetes is a serious operational commitment that should be undertaken with eyes open.

The signal to leave Compose: You’re hand-running multiple Compose hosts. You need to deploy without downtime. You need autoscaling. You’re the only person who understands the deployment scripts.

8.6 Run as root vs non-root

Situation: Setting the user inside the container.

Option A: Default (root). Easiest — no permission issues. Option B: USER directive with a non-root user. Slightly harder to set up.

What experienced engineers actually do: Always non-root in production. The cost is a one-time Dockerfile pattern (RUN useradd -r app && USER app); the benefit is that a compromise of the container app is no longer a UID-0 process inside the namespace. Combined with dropping capabilities (--cap-drop=ALL plus targeted adds), this dramatically reduces what an attacker can do post-exploitation. The only legitimate reasons to run as root are genuine privileged operations (binding to ports below 1024, mounting filesystems, raw network access) — and most of those have non-root workarounds (use port 8080 and let the host map it; use capabilities instead of full root).

8.7 Building on the same machine you deploy to vs a CI registry workflow

Situation: How do builds get to production?

Option A: Build on production hosts (legacy pattern: git pull && docker build). Option B: Build in CI, push to registry, pull on production.

What experienced engineers actually do: Always build in CI and pull on production. Building on production hosts couples the build environment to the runtime environment, makes builds non-reproducible (network access, time of apt update, host-specific quirks), and means a build failure can take down production capacity. Build artifacts (images) should be pinned to digests and treated as immutable. Push to a registry. Pull by digest. The build machine and the runtime machine should be separable concerns.

8.8 `:latest` vs immutable tags vs digests

Situation: How to reference images.

Option A: myapp:latest. Always points to most recent push. Option B: myapp:1.2.3 or myapp:git-sha-abc. Immutable per release. Option C: myapp@sha256:.... Truly content-addressed, cannot move.

What experienced engineers actually do: Tags by git SHA or semver in CI/CD pipelines. Digests in production manifests for true immutability. Never :latest for anything that survives a single shell session. Tags are pointers; digests are addresses. If you need to know “what is actually running right now,” only the digest answers honestly.

8.9 Stateful workloads in containers (databases, etc.)

Situation: Should the production Postgres run in a container?

Option A: Yes, in a container with a volume. Option B: No, run it on the host or use a managed service (RDS, Cloud SQL).

What experienced engineers actually do: For production: managed service almost always. The operational pain of running a stateful database in containers (backups, replication, failover, version upgrades, disk growth, performance tuning) is enormous, and a managed service does it better than you will. For development and CI: containers are perfect — ephemeral, easy to spin up, version-pinned. The right answer changes by environment, not by ideology. Containers are great for stateless services and great for ephemeral state; for production durable state, they are a tool, not a solution.

8.10 BuildKit vs classic builder

Situation: Configuring Docker’s build system.

Option A: Classic (legacy). Option B: BuildKit (default in modern Docker).

What experienced engineers actually do: BuildKit. Always. Faster, better caching (including remote cache), parallel stage execution, secret mounts, SSH forwarding, better build output. There is no reason to use the classic builder in 2026. Make sure DOCKER_BUILDKIT=1 is set or that you’re using docker buildx.

8.11 Container image scanning: do it, ignore it, or block on it?

Situation: Your scanner reports 200 CVEs in your base image.

Option A: Block deploys on any high-severity CVE. Option B: Track them, fix them when convenient. Option C: Ignore the report; it’s noise.

What experienced engineers actually do: Track them, prioritize by exploitability, fix on a cadence. The scanner-output-as-a-blocker approach sounds nice but breaks down: many CVEs are in libraries the running code never calls, and blocking deploys on these creates pressure to silence the scanner. The right model is: minimize the base (distroless, scratch where possible) so the scanner has less to find; rebuild base images on a schedule (weekly) to pick up upstream patches; treat scanning as one signal among many, not a gate. CVEs in compiled-out code aren’t real risk; CVEs in actively exploited paths are. A staff engineer doesn’t worship the scanner.

8.12 Docker Desktop vs Lima/Colima/OrbStack on Mac

Situation: You’re on a Mac and need a Docker daemon.

Option A: Docker Desktop. Easy, official, GUI. License fee for larger orgs. Option B: Lima/Colima/OrbStack/Rancher Desktop. Open source alternatives running their own VM.

What experienced engineers actually do: Increasingly the alternatives. OrbStack in particular has earned a reputation for being faster and lighter than Docker Desktop. Colima is the rugged-CLI option. The Docker Desktop license model (paid for companies above 250 employees or $10M revenue) makes the alternatives meaningful for organizations beyond a few people. The CLI is identical; the daemon is identical. The choice is mostly about VM/UI quality and licensing.

9. The Commands/APIs That Actually Matter

The 80% you’ll use 80% of the time, organized by task. Not a cheatsheet — a curated set with the patterns experienced users reach for.

Building images

# Build an image with a tag, from current directory
docker build -t myapp:0.1 .

# Build a specific stage of a multi-stage Dockerfile
docker build --target builder -t myapp:builder .

# Build with build args (useful for version pins)
docker build --build-arg APP_VERSION=1.2.3 -t myapp:1.2.3 .

# Use BuildKit features (cache mounts, secret mounts)
DOCKER_BUILDKIT=1 docker build .

# With buildx, build for multiple architectures (amd64 + arm64)
docker buildx build --platform linux/amd64,linux/arm64 -t myapp:0.1 --push .

# Use a remote build cache to speed up CI
docker buildx build \
  --cache-from=type=registry,ref=ghcr.io/me/myapp:cache \
  --cache-to=type=registry,ref=ghcr.io/me/myapp:cache,mode=max \
  -t myapp:0.1 .

The cache-from/cache-to pair is what makes CI builds fast. Without it, every CI run starts with a cold cache and rebuilds everything.

Running containers

# Run with all the production-hygiene flags you actually want
docker run -d \
  --name myapp \
  --restart unless-stopped \      # restart on crash, not on docker daemon restart
  --memory=512m --cpus=0.5 \      # cgroup limits
  --user 1000:1000 \              # not root
  --read-only \                   # rootfs is RO; force tmpfs for writable paths
  --tmpfs /tmp:size=64m \
  -p 8080:80 \
  -e DB_HOST=db \
  -v myapp-data:/var/lib/myapp \
  --network app-net \
  myapp:1.2.3

# One-off interactive shell
docker run --rm -it --entrypoint bash myapp:1.2.3

# Exec into a running container
docker exec -it myapp bash

# Override the entrypoint without an interactive shell
docker run --rm --entrypoint /bin/echo myapp:1.2.3 "hello"

--restart unless-stopped is the right default for daemon-style services: restart on crash, restart when the host reboots, but don’t restart if you explicitly stopped it. --read-only plus --tmpfs is a strong defense-in-depth measure that costs nothing for stateless services.

Inspecting and debugging

# What's running, with formatting
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

# Logs (always use --tail to avoid drowning)
docker logs --tail=100 -f myapp

# Full inspect — exhaustive JSON of container state
docker inspect myapp

# Just one field via Go template
docker inspect myapp --format '{{.NetworkSettings.IPAddress}}'
docker inspect myapp --format '{{json .Mounts}}' | jq

# Live resource usage
docker stats myapp

# What changed in the writable layer?
docker diff myapp

# Image layer history (useful for spotting bloat or accidental secrets)
docker history myapp:1.2.3
docker history --no-trunc myapp:1.2.3 | grep -i secret  # paranoid check

docker inspect with --format is your friend when scripting. Don’t grep the JSON; use templates.

Networking

docker network create app-net
docker network create --driver bridge --subnet 10.10.0.0/16 mynet  # custom subnet
docker network ls
docker network inspect app-net
docker network connect app-net myapp     # add to a second network
docker network disconnect bridge myapp   # remove from one

Volumes

docker volume create pgdata
docker volume ls
docker volume inspect pgdata
docker volume rm pgdata

# Backup a volume (mount it into a one-shot container that tars it)
docker run --rm -v pgdata:/data -v "$PWD":/backup alpine \
  tar czf /backup/pgdata.tar.gz -C /data .

# Restore
docker run --rm -v pgdata:/data -v "$PWD":/backup alpine \
  tar xzf /backup/pgdata.tar.gz -C /data

The “use a throwaway container with two mounts to copy data” pattern is the universal escape hatch when you need to do something Docker doesn’t have a direct command for.

Cleanup

docker system df                              # see what's using space
docker system prune                           # remove stopped containers, dangling images, unused networks
docker system prune -a                        # also remove unused images (not just dangling)
docker system prune -a --volumes              # nuclear; also unused volumes
docker container prune
docker image prune
docker volume prune
docker builder prune                          # buildkit cache (often huge)

docker system df shows the breakdown — images, containers, volumes, build cache. The build cache in particular grows aggressively and is rarely visible to users; check it.

Compose

docker compose up -d                          # bring up in background
docker compose down                           # tear down (keeps volumes by default)
docker compose down -v                        # also delete volumes (destructive)
docker compose logs -f service_name
docker compose exec service_name bash
docker compose ps
docker compose pull                           # pull latest images for all services
docker compose build --no-cache               # rebuild without cache
docker compose restart service_name
docker compose run --rm service_name command  # one-off task with same config

docker compose run --rm is how you run database migrations, one-off scripts, etc., using the same image and environment as your real services.

Registry operations

docker login ghcr.io
docker tag myapp:0.1 ghcr.io/myorg/myapp:0.1
docker push ghcr.io/myorg/myapp:0.1
docker pull ghcr.io/myorg/myapp:0.1

# Pull by digest for true immutability
docker pull ghcr.io/myorg/myapp@sha256:abc123...

# Inspect a remote image without pulling it
docker buildx imagetools inspect ghcr.io/myorg/myapp:0.1

10. How It Breaks

The failure modes you’ll actually hit, and how to think about debugging them. Each one connects back to the architecture of Section 6 — when things break, your map of where state lives tells you where to look.

Failure: Container won’t start

Symptoms: docker run returns immediately. docker ps -a shows status Exited (n) where n is some non-zero code.

How to diagnose: docker logs <container>. Most of the time the answer is right there: missing env var, can’t bind to port, can’t find a file. If logs are empty, the process crashed before producing output — usually a missing binary or library. Try docker run --rm -it --entrypoint sh <image> to poke around.

Common causes: Missing required env var. Port already bound on the host. Volume mount path doesn’t exist or has wrong permissions. Architecture mismatch (you built for arm64 but you’re on amd64 — common on M-series Macs).

Failure: Container starts but my app isn’t reachable

Symptoms: docker ps shows it running. curl localhost:8080 hangs or refuses.

How to diagnose: First, docker logs to confirm the app actually started and is listening. Then check what address it bound to inside the container — apps that bind to 127.0.0.1 are unreachable from outside the container. They must bind to 0.0.0.0. Then check docker port <container> to see what’s published. Finally, check from inside the container that it’s listening: docker exec <container> netstat -tlnp (or ss -tlnp).

Common causes: App bound to 127.0.0.1 instead of 0.0.0.0. EXPOSE in Dockerfile but no -p flag at runtime — EXPOSE is documentation; it doesn’t publish ports. Container on a network that’s not the default bridge. Firewall on the host blocking the port.

Failure: Two containers can’t talk to each other

Symptoms: Container api can’t reach container db by name.

How to diagnose: Check whether they’re on the same user-defined network. docker inspect <container> --format '{{json .NetworkSettings.Networks}}' shows what networks the container is on. If they’re on the default bridge, name resolution doesn’t work — only IP. Move them both to a user-defined network.

Common causes: Default bridge network, no DNS. Different networks. Wrong service name (the DNS name is the container name or the service name in Compose, not the hostname inside the container).

Failure: `docker stop` is slow

Symptoms: Container takes 10 seconds to stop.

How to diagnose: This is the PID 1 signal problem. The kernel isn’t delivering SIGTERM to your app, so it doesn’t shut down until SIGKILL. Confirm by checking what’s actually PID 1 inside: docker exec <container> ps -ef. If it’s /bin/sh -c your-command, you’re using shell-form CMD. If it’s your app and it’s still ignoring SIGTERM, your app doesn’t have a handler.

Fix: Use exec-form CMD/ENTRYPOINT. Make your app handle SIGTERM. Use tini if needed.

Failure: Image build is slow / cache isn’t working

Symptoms: Builds take forever even when very little changed.

How to diagnose: Watch the build output. Each step should say CACHED if its inputs were unchanged. The first step that says it’s running again is your cache-bust. Look at what feeds into that step.

Common causes: COPY . . near the top — every file change invalidates everything. apt-get update and apt-get install in separate RUN commands — the cached update is from days ago and install fails on stale package indexes. Build args that change every build (like a build timestamp) burning the cache. Lack of a .dockerignore, so node_modules or .git size and contents change between builds.

Fix: Reorder Dockerfile, slow-changing things first. Combine update and install in one RUN. Add .dockerignore.

Failure: Out of disk space

Symptoms: no space left on device from Docker. df -h shows /var/lib/docker is huge.

How to diagnose: docker system df shows the breakdown. Usually the culprit is one of: unbounded container logs, leftover stopped containers, dangling images, build cache, unused volumes.

Fix: Configure log rotation in daemon.json (Section 7.9). Run docker system prune -a --volumes (carefully — this deletes anything not currently in use). For build cache specifically, docker buildx prune. Set up a cron job for periodic cleanup on long-lived hosts.

Failure: OOMKilled

Symptoms: docker inspect shows "OOMKilled": true. Container just disappeared.

How to diagnose: This means the kernel killed the process for exceeding its memory cgroup limit. Your --memory was too low, or your app has a memory leak, or the workload is genuinely too heavy for the limit. Look at docker stats while it’s running to see the curve.

Fix: Raise the limit, or fix the leak, or downsize the workload. Don’t run unbounded — at least the OOMKill is contained to one container instead of taking out the host.

Failure: Docker Desktop / daemon is unresponsive

Symptoms: docker ps hangs. Compose hangs.

How to diagnose: The daemon is wedged. Check its logs (journalctl -u docker on Linux, ~/Library/Containers/com.docker.docker/Data/log/ on macOS). Common causes: bad networking driver state, disk full, internal deadlock from too many containers.

Fix: Restart the daemon (sudo systemctl restart docker or restart Docker Desktop). On Mac specifically, the underlying VM occasionally needs a “Reset to factory defaults” — annoying but fast. This is a real failure mode of Mental Model 4: the daemon is a single point of failure.

General debugging workflow

When something is wrong and you don’t know what:

docker ps -a — is it actually running? what’s its status?
docker logs --tail=200 <container> — what did it say?
docker inspect <container> — what’s its actual configuration? (env vars, mounts, network)
docker exec -it <container> sh — get inside and look around
docker stats <container> — is it CPU/memory bound?
docker events (in another terminal, before reproducing) — get a stream of daemon-level events

If you can’t exec because the container won’t stay up, run a copy with the entrypoint overridden: docker run --rm -it --entrypoint sh <image>. Now you can poke around the image as if you were root inside it.

11. The Downsides / Disadvantages

This is the section the evangelists skip. Docker has real, structural costs that don’t go away with experience or better config. Walk in with eyes open.

11.1 Containers are not a real security boundary.

The shared kernel means a kernel exploit in one container is a kernel exploit on the host and every other container. Docker provides isolation, not security. Real isolation for adversarial workloads requires a real boundary: VMs, gVisor, Kata Containers, Firecracker. The phrase “Docker is secure by default” is repeated constantly and is misleading: Docker is isolated by default, which is not the same thing.

Where it comes from: Mental Model 3. Selective isolation, single shared kernel. This is the entire point of containers — they’re cheap because they share a kernel — but it means the security boundary is the kernel, not the container.

What it costs you: Multi-tenant SaaS, customer-supplied code, plugin ecosystems, anything that runs untrusted code in your containers must use a stronger sandbox. The “we use containers, so we’re fine” reasoning has produced multiple high-profile breaches.

When it’s a dealbreaker: Hosting customer code, untrusted user uploads being processed by your service, multi-tenant compute platforms. Use gVisor, Firecracker, or actual VMs.

When you can live with it: Your own first-party services in your own infrastructure, where everyone running in containers is trusted code. Which is most workloads.

What people think mitigates it but doesn’t: Running as non-root inside the container. Dropping capabilities. Seccomp profiles. These all reduce the attack surface, but none of them addresses the fundamental fact that one kernel CVE is one kernel CVE. They’re defense in depth, not a real boundary.

11.2 The Docker daemon is a privileged single point of failure.

dockerd runs as root. It owns every container on the host. Anyone with API access has root. If it crashes or hangs, your containers may keep running, but you can’t manage them — no logs, no restarts, no deploys, no scaling.

Where it comes from: Mental Model 4. The daemon-centric architecture is the original Docker design choice. It made the developer experience great in 2013 but has aged poorly.

What it costs you: Operational risk concentrated in one process. Security model where “access to docker.sock = root” pervades CI, dev tools, and orchestration code. Slower than necessary in some operations because everything funnels through the daemon’s API.

When it’s a dealbreaker: Highly regulated environments, multi-tenant CI runners, security-paranoid orgs. They reach for Podman (daemonless, can run rootless natively), runC directly, or full Kubernetes (where the runtime is closer to the kernel and Docker is bypassed).

What people think mitigates it but doesn’t: Rootless Docker — works, but is a second-class experience with subtle differences in networking, storage, and capabilities. Useful but not the default and not seamless.

11.3 Image size and pull time at scale.

The pretty “containers are tiny” story breaks down once your real images include a Python runtime, a JVM, ML models, and platform libraries. A “small” production image is often 500 MB to several GB. Pulling that image to a new host costs minutes. Across thousands of nodes, your network bill is real.

Where it comes from: Mental Model 2. Images are stacks of complete filesystem snapshots. There is no “shared library across images” beyond the layer-deduplication trick — and that helps less than you’d hope, because most teams have heterogeneous base images.

What it costs you: Slow autoscaling (new node = slow first request because image is pulling). Slow CI (pull + build + push, repeat). Real bandwidth costs at scale. Cold-start latency for serverless containers.

When it’s a dealbreaker: Edge deployments with thin uplinks. Functions-as-a-service where cold-start matters. High-density hosts with thousands of containers from many different bases.

Can be improved by: Multi-stage builds, distroless bases, regional registries with image caching, P2P image distribution (Dragonfly, Kraken). These help; they don’t fundamentally change the size of the artifact.

11.4 Storage drivers are a ghost in the machine.

OverlayFS works, mostly. But “mostly” is doing real work in that sentence. There are corners: copy-up performance penalty on first-write to a large file in a lower layer; subtle edge cases with rename(2) across layers (EXDEV); inconsistent behavior across filesystems (xfs with ftype=0 doesn’t work; ZFS needs different drivers); broken inotify semantics in some configurations; ten years of accumulated bug reports.

Where it comes from: OverlayFS is a kernel feature with constraints that don’t match a normal filesystem.

What it costs you: Weird performance bugs that are hard to diagnose because they’re a layer underneath your application. “Why is this database slow in a container” is often “you’re writing to the writable layer, copy-up is killing you.”

When it’s a dealbreaker: Heavy-write workloads on the writable layer. Use a volume.

11.5 Non-determinism in builds.

apt-get install python3 today is not the same as apt-get install python3 next month. Your Dockerfile that worked perfectly six months ago can fail to build today, or build successfully but produce a different image. Docker doesn’t lock package versions across the OS layer; it’s not Nix, it’s not Bazel.

Where it comes from: Dockerfiles are imperative scripts that go fetch arbitrary stuff from the internet. The cache layer is content-addressed but the inputs aren’t pinned.

What it costs you: Reproducibility theater. Your CI proves the build works today. The same Dockerfile, no changes, will fail to build six months from now. You’ll discover this when you need to rebuild.

When it’s a dealbreaker: Long-lived services where you need to be able to reproduce a 2-year-old build for an audit, regulatory inquiry, or security investigation. You either lock down every package version (painful) or accept that “rebuild the old version” doesn’t reliably work.

11.6 Operational complexity at scale.

Docker is simple for one container. Twenty containers needs Compose. Twenty hosts needs an orchestrator. A hundred services needs a registry strategy, image scanning, signed images, network policies, secrets management, logging pipelines, monitoring, autoscaling, multi-AZ deployment. Every step adds tooling and cognitive load. The “containers will simplify deployment” promise is true for a small system; it inverts at scale.

Where it comes from: The Docker abstraction is just the running container. Everything around it — discovery, scheduling, secrets, networking across hosts, durable storage — is your problem. The ecosystem has grown to fill the gaps, but the gap-filling tools (Kubernetes alone is enormous) come with their own learning curves.

What it costs you: Real headcount. A serious Kubernetes deployment has a platform team. You can’t run a non-trivial container-orchestrated system as a side project.

When it’s a dealbreaker: Small teams who don’t need the scale. The honest answer for “should we use Kubernetes” for a 5-person startup with 50 customers is “no.” Single-host Compose or a managed PaaS will get you years of runway with a fraction of the ops burden.

11.7 Docker Desktop license model.

Docker Desktop is no longer free for organizations above 250 employees or $10 million in revenue. The CLI is free; Desktop costs. This is a real cost line for medium and large companies, not a rounding error.

Where it comes from: Docker, Inc.’s commercial pivot in 2021, after years of struggling to monetize the open-source product.

What it costs you: Per-developer license fees, procurement overhead, sometimes friction with security teams who don’t want to install Desktop’s components. Or: the migration cost to alternatives (Colima, OrbStack, Rancher Desktop, Podman Desktop) — which works fine but is real engineering time.

When it’s a dealbreaker: Mid-size organizations doing a budget review. Many have already migrated.

11.8 The Windows containers fork is a separate world.

Windows containers exist, but they’re a parallel universe. Different image format (mostly), different base images, different driver, different size profile (tens of GB for Windows Server cores), different licensing. “Container” without qualifier means “Linux container,” and most ecosystem tooling assumes Linux.

Where it comes from: Windows doesn’t have the same kernel primitives Linux does; Microsoft built a parallel implementation.

What it costs you: If you have legacy Windows software you want to “containerize,” the experience is rougher and the community much smaller. If you’re targeting Linux containers, ignore this — the Windows fork doesn’t affect you.

11.9 Logs are structurally inadequate for production.

Docker’s logging contract is “everything goes to stdout/stderr; we capture it to a file.” Fine for development. Inadequate for production: logs are unstructured, get rotated and lost, can fill the disk, have no retention policy, are hard to query across containers. Every serious deployment needs a logging pipeline (Fluentd/Vector → ElasticSearch/Loki/Datadog) on top.

Where it comes from: Mental Model 1. A container is a process; a process’s logs are stdout/stderr. There’s nothing more.

What it costs you: A real logging pipeline is a separate system to build, run, and maintain. Without one, you find out about production issues by ssh-ing to hosts and running docker logs.

11.10 The ecosystem’s “official image” problem.

Pulling mysql:latest or redis:latest feels safe — they’re “official” images! But “official” on Docker Hub means “maintained by the Docker community on a best-effort basis,” not “audited by the project’s security team.” Many official images run as root, ship with old packages, and have CVE counts that are alarming if you actually look. The default culture around images is “it works,” not “it’s secure.”

Where it comes from: Docker Hub is a public registry with a low quality bar, and the community filled it with images optimized for getting people to “hello world” fast.

What it costs you: Production deployments built on official images often inherit known vulnerabilities silently. Hardened-image vendors (Chainguard, Wolfi, RedHat UBI) exist precisely to fill this gap. For real production, treat official images as starting points to build your own from, not finished products.

12. The Taste Test

Docker code review red flags vs green flags. What separates someone who’s been doing this for two years from someone who watched a tutorial last week.

Bad Dockerfile (beginner)

FROM ubuntu
RUN apt-get update
RUN apt-get install -y python3 python3-pip nodejs npm git curl wget vim build-essential
WORKDIR /app
COPY . /app
RUN pip3 install -r requirements.txt
RUN npm install
EXPOSE 3000
CMD python3 app.py

Red flags everywhere: untagged base (Mental Model 2 — :latest will move under you), apt-get update and install in separate RUNs (cache will lie), kitchen-sink package list including vim and git in production, COPY . /app before installing dependencies (every code change invalidates the install layer), no non-root user, shell-form CMD (signals will be eaten), no multi-stage so build tools end up in production, EXPOSE without binding to 0.0.0.0 discipline elsewhere.

Good Dockerfile (experienced)

# syntax=docker/dockerfile:1.7

# --- builder ---
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install --user --no-warn-script-location -r requirements.txt

# --- runtime ---
FROM python:3.12-slim AS runtime
RUN useradd -r -u 1000 -m app
WORKDIR /app
COPY --from=builder /root/.local /home/app/.local
COPY --chown=app:app . .
USER app
ENV PATH=/home/app/.local/bin:$PATH \
    PYTHONUNBUFFERED=1
EXPOSE 8080
ENTRYPOINT ["python", "-m", "myapp"]

Pinned base, multi-stage, BuildKit cache mount for pip (massive build speedup without bloating the image), non-root user, exec-form ENTRYPOINT, PYTHONUNBUFFERED=1 so logs flush immediately (one of those tiny details that takes a 3am debugging session to learn).

Bad Compose file (beginner)

version: '3'
services:
  web:
    build: .
    ports:
      - "80:80"
    volumes:
      - ./data:/data
  db:
    image: mysql
    environment:
      MYSQL_ROOT_PASSWORD: password

mysql (not pinned), :latest implied, password as plaintext in committed file, host bind for production data, no network defined, no healthchecks, no resource limits, top-level version (deprecated in modern compose).

Good Compose file (experienced)

services:
  web:
    image: ghcr.io/myorg/web:1.4.2
    restart: unless-stopped
    deploy:
      resources:
        limits: { memory: 512M, cpus: '0.5' }
    ports:
      - "127.0.0.1:8080:8080"   # bound only to localhost; reverse proxy handles public exposure
    environment:
      DB_HOST: db
      DB_USER_FILE: /run/secrets/db_user
      DB_PASSWORD_FILE: /run/secrets/db_password
    secrets: [db_user, db_password]
    depends_on:
      db: { condition: service_healthy }

  db:
    image: postgres:16.3-alpine
    restart: unless-stopped
    volumes:
      - pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_USER_FILE: /run/secrets/db_user
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
    secrets: [db_user, db_password]
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U $$(cat $$POSTGRES_USER_FILE)"]
      interval: 5s
      timeout: 3s
      retries: 5

volumes:
  pgdata:

secrets:
  db_user:
    file: ./secrets/db_user.txt
  db_password:
    file: ./secrets/db_password.txt

Pinned versions, bound to localhost not 0.0.0.0 (let a reverse proxy handle TLS), resource limits, healthchecks gating dependent service startup, secrets via files not env vars (env vars leak through docker inspect and process listings), volume for the database, no top-level version (modern compose schema).

Other taste markers

Image size. A Go service should be ~15 MB. A Python service should be ~80–150 MB. A Java service ~150–300 MB. If your simple service image is 1+ GB, something is wrong — almost certainly missing multi-stage or a fat base.
Image scan output. An experienced engineer’s image has dozens of CVEs at most, mostly low-severity. A beginner’s has hundreds, including “ancient ssl library nobody patched in 4 years.”
docker history. Should show clean, semantic layers. Should not contain RUN echo $SECRET > /tmp/... followed by RUN rm /tmp/... — the secret is still there.
PID 1. docker exec myapp ps should show your app as PID 1, not /bin/sh -c. If it’s the shell, signals are broken.
Exit code on graceful stop. time docker stop myapp should return in under 2 seconds. If it takes 10+, signal handling is broken.
Image labels. Production images should have OCI labels: source git commit, build time, version. docker inspect should tell you exactly what’s running.
Repo discipline. A .dockerignore exists and excludes .git, node_modules, __pycache__, *.log, .env*. The Dockerfile lives at the repo root, with the application source organized so the build context is small.

13. Where to Go Deeper

Curated, opinionated. Not a list of “10 awesome Docker resources” — these are the ones worth your time.

The Docker docs, specifically the Engine architecture overview and the storage drivers section. Most of the docs are reference; the architecture sections are genuinely worth reading. Read these once in full when you start.
“Build Your Own Docker” tutorials. The hands-on exercise of building a tiny container runtime using unshare, chroot, OverlayFS, and cgroups — directly with shell commands — collapses the magic faster than any reading. Liz Rice has done this as a live-coded talk multiple times; Akash Rajpurohit has written it up at akashrajpurohit.com. Spend an afternoon on it. You will never again think a container is a VM.
Julia Evans’ container blog posts (jvns.ca). Especially “How containers work: overlayfs.” Julia’s voice is the gold standard for explaining systems concepts to people who don’t yet know them. Read everything she’s written about containers.
Jérôme Petazzou’s “From dotCloud to Docker” (jpetazzo.github.io). The actual person who built the early infrastructure walking through the design decisions that became Docker. Historical context that explains why Docker is the way it is.
The OCI specs (github.com/opencontainers). Image-spec and runtime-spec. These are short, precise, and demystifying. Once you’ve read them, “Docker images” stop feeling proprietary — they’re just specified data structures.
Liz Rice, Container Security (O’Reilly, 2nd ed). The serious treatment of container security. Covers what the Docker docs gloss over: capabilities, seccomp, AppArmor, the realities of multi-tenant isolation. If you’re putting containers near production data, read this.
Adrian Mouat, Using Docker (O’Reilly). Slightly older but still the best general book on Docker for working engineers. Covers production patterns, networking, security in depth.
The BuildKit documentation (docs.docker.com/build/buildkit/). Less famous than the basic Docker docs but where the action is. Cache mounts, secret mounts, multi-platform builds, remote caching — the modern build pipeline lives here.
A “containers from scratch” hands-on project. Pick one: build your own minimal container in Python or Go using direct syscalls. Liz Rice’s “Containers from Scratch” Go demo is canonical. The understanding you get from this doesn’t come from any reading.

14. The Final Verdict

Docker, after twelve years and a billion docker run commands, is two things at once: a packaging format that genuinely changed how software is shipped, and a daemon-centric runtime architecture that has aged poorly. You should know both. The packaging format is the part that won; it has been standardized as OCI and adopted by every cloud, every CI system, every PaaS. The runtime architecture — root daemon, single point of failure, “docker.sock equals root” — is the part the industry is quietly leaving behind in favor of containerd, runc, Podman, and Kubernetes’ own runtime layer. When people say “Docker is everywhere,” they really mean OCI is everywhere; Docker the company and Docker the daemon are increasingly just one option among several for the same standard.

What Docker got profoundly right: it made developer experience a feature. The single command docker run nginx is a piece of design artistry — every choice in it is correct. It picked the right primitives (already-existing kernel features, not new ones) and wrapped them in conventions everyone could grasp (a Dockerfile is a recipe; an image is a tarball; a registry is git). And the layer model, with content-addressed storage and copy-on-write, is genuinely elegant — when you understand it, you stop being confused about a dozen otherwise-mysterious behaviors. These three design moves are why containers became table stakes within four years of Docker’s launch.

What Docker got wrong, or what it costs you: it sold a security story it couldn’t back up. Containers are isolation, not security, and the industry spent five years rediscovering this the hard way through breaches and CVE crises. The daemon-centric architecture is a 2013 design decision the codebase can’t escape from, which is why every serious production system either runs containers without Docker (Kubernetes with containerd) or works around the daemon’s limitations (rootless, image scanning, restrictive socket policies). And the at-scale operational story — pull times, image sizes, log volumes, stateful data — is much harder than the marketing implies. If your image is a gigabyte and you’re spinning up containers across a thousand nodes, you have a real distribution problem that “containers are lightweight” doesn’t solve.

Who should reach for Docker, and who shouldn’t. Reach for it when you’re shipping software you wrote yourself, you control the runtime environment, you want a reproducible artifact and a fast development inner loop, and your scale is anywhere between “one server” and “a few hundred.” Reach for it for CI builds, dev environments, and the great unsexy middle of corporate software where containers replace ten pages of “how to set up the dev environment” with docker compose up. Don’t reach for it as a security boundary for adversarial code — use VMs or specialized sandboxes. Don’t reach for it for stateful production databases when a managed service exists. Don’t pay the Kubernetes cost just because you’re paying the Docker cost; single-host Compose is a perfectly reasonable production target and most teams will never outgrow it. And don’t confuse familiarity with fitness — “we use Docker for everything” is usually a sign that the team stopped asking which tool fits which problem.

What you should now believe. Believe that containers are processes, not machines, and that everything weird about containers follows from that. Believe that images are content-addressed stacks, and that this explains both their elegance and their persistence-of-secrets failure mode. Don’t believe that containers are a secure boundary for untrusted code; they are not. Don’t believe latest means anything; pin everything. Don’t believe Docker is the only container runtime; it is one of several, and the rest of the industry has quietly moved on. When you hear “we containerized our app,” what they probably mean is “we wrote a Dockerfile that builds an image that runs in production.” That’s enough; that’s the whole game; the rest is operational discipline.

The hard-won line: Docker is a packaging format that won, wrapped in a runtime that the rest of the world has moved past. Use the format with confidence; use the runtime with eyes open; understand which is which.

The ideas are mine. The writing is AI assisted

1. One-Sentence Essence

2. The Problem It Solved

3. The Concepts You Need

Kernel primitives (what Docker is built from)

Image and packaging concepts

Runtime and orchestration concepts

Operational concepts

4. The Distilled Introduction

Setup

Your first container

Building your own image

Running and managing containers

Volumes and bind mounts

Networking

Multi-container apps with Docker Compose

Multi-stage builds: the production-ready Dockerfile pattern

Pushing and pulling

What you can now do

5. The Mental Model

Core Idea 1: A container is a process, not a machine.

Core Idea 2: An image is a stack of immutable layers; a container is that stack plus one writable layer on top.

Core Idea 3: Isolation is selective, not total.

Core Idea 4: The Docker daemon owns everything; the CLI is a thin client.

6. The Architecture in Plain English

7. The Things That Bite You

7.1 The PID 1 problem (graceful shutdown doesn’t work)

7.2 latest is not a version

7.3 Secrets in layers persist forever

7.4 The default bridge has no DNS

7.5 COPY . . invalidates everything below it

7.6 Bind-mounted files are owned by the container’s user

7.7 Containers run unlimited by default

7.8 Docker socket access = root

7.9 Storage fills up silently

7.10 Bind mounts on Mac/Windows are slow

8. The Judgment Calls

8.1 Alpine vs Debian-slim base

8.2 Volume vs bind mount

8.3 Single-stage vs multi-stage build

8.4 ENTRYPOINT vs CMD

8.5 Compose vs Kubernetes (or Swarm, or nothing)

8.6 Run as root vs non-root

8.7 Building on the same machine you deploy to vs a CI registry workflow

8.8 :latest vs immutable tags vs digests

8.9 Stateful workloads in containers (databases, etc.)

8.10 BuildKit vs classic builder

8.11 Container image scanning: do it, ignore it, or block on it?

8.12 Docker Desktop vs Lima/Colima/OrbStack on Mac

9. The Commands/APIs That Actually Matter

Building images

Running containers

Inspecting and debugging

Networking

Volumes

Cleanup

Compose

Registry operations

10. How It Breaks

Failure: Container won’t start

Failure: Container starts but my app isn’t reachable

Failure: Two containers can’t talk to each other

Failure: docker stop is slow

Failure: Image build is slow / cache isn’t working

Failure: Out of disk space

Failure: OOMKilled

Failure: Docker Desktop / daemon is unresponsive

General debugging workflow

11. The Downsides / Disadvantages

11.1 Containers are not a real security boundary.

11.2 The Docker daemon is a privileged single point of failure.

11.3 Image size and pull time at scale.

11.4 Storage drivers are a ghost in the machine.

11.5 Non-determinism in builds.

11.6 Operational complexity at scale.

11.7 Docker Desktop license model.

11.8 The Windows containers fork is a separate world.

11.9 Logs are structurally inadequate for production.

11.10 The ecosystem’s “official image” problem.

12. The Taste Test

Bad Dockerfile (beginner)

7.5 `COPY . .` invalidates everything below it

8.8 `:latest` vs immutable tags vs digests

Failure: `docker stop` is slow