deep·tech·intuition
intermediate ·

Kubernetes Deep Intuition

An experienced engineer's guide to Kubernetes

1. One-Sentence Essence

Kubernetes is a control loop over a desired-state database — you tell it what should be true about the system, and a horde of small controllers fight reality until it matches.

Not “container orchestrator.” Not “cluster scheduler.” Those are surface descriptions. The deep truth is: Kubernetes is a giant while(true) that watches a JSON document called desired state and an observed reality called current state, and continuously closes the gap between them. Containers are an implementation detail. Pods, Deployments, Services — implementation details. The loop is the thing.

The rest of this document is just unpacking that sentence.


2. The Problem It Solved

In 2014, you ran services like this: you SSH’d into a machine, ran your binary under systemd or supervisord, watched it for a while, and prayed. If the machine died, you got paged. If you needed three copies, you SSH’d into three machines. If you needed to deploy a new version, you wrote an Ansible playbook that pulled new code, restarted the process, and hoped you got the rolling pause right. Service discovery was an HAProxy config you regenerated nightly, or Consul if you were fancy. Storage was whatever EBS volume you remembered to detach.

This worked. It worked at Google scale, even, with enough engineers. But three things became unbearable at the same time. First, containers happened. Docker (2013) made it trivially easy to package an app with its dependencies. Suddenly you had hundreds of containers per company, not three or four big VMs. Managing them by hand stopped scaling. Second, microservices happened. Companies stopped shipping monoliths and started shipping fleets of small services that all talked to each other. The number of moving parts exploded by an order of magnitude. Third, hardware got cheap and elastic. Cloud meant your fleet could double in 5 minutes for Black Friday and shrink back overnight. The static “this service runs on these three boxes” model broke.

Google had been doing this internally with a system called Borg since 2003. Borg’s core idea was radical: you don’t tell Borg how to run your job. You tell it what should be true — “I want 100 instances of this binary, with these resource requirements, prefer not to put more than 5 on a single rack” — and Borg figures out where to put them, restarts them when they die, evacuates them when machines go down for maintenance, and tells you when it can’t satisfy your request. The operator becomes a declarative document, not a runbook.

Kubernetes (2014) was Borg’s third reincarnation, this time open-sourced and rebuilt around containers. The pitch was: give us your YAML, and we’ll keep your services running. It worked. The 2017 K8s adoption wave was driven not by the brilliance of the design (which has plenty of warts), but by the fact that nothing else in the world let you describe “I want N copies of this service running across these machines, exposed via a stable IP, auto-replaced when they die” in 30 lines of declarative config. Then the cloud providers wrapped it as managed services (EKS, GKE, AKS), and Kubernetes became the default substrate for non-trivial backend systems. That’s where we are now.


3. The Concepts You Need

You cannot reason about Kubernetes without these. Skim if any are familiar; come back if a later section uses one and you’ve forgotten.

Cluster topology

  • Cluster — one Kubernetes “instance.” Has a control plane and one or more worker nodes. From the outside, a cluster looks like a single API endpoint that takes YAML.
  • Node — a machine (VM or physical) that runs containers. Worker nodes run your workloads; control-plane nodes run Kubernetes itself. In managed services like EKS/GKE, you only see worker nodes; the control plane is hidden.
  • Control plane — the brain. The set of components that store cluster state and decide what should run where. Specifically: the API server, etcd, the scheduler, and the controller manager. These typically live on dedicated nodes (or are managed by your cloud provider).
  • kubelet — the agent on each worker node. It receives “you should be running these pods” instructions from the control plane and makes them happen by talking to the local container runtime.
  • Container runtime — the thing on each node that actually runs containers. containerd and CRI-O are the modern options; Docker-the-runtime was deprecated in 2020. Don’t confuse Docker-the-image-format (still ubiquitous) with Docker-the-runtime (gone from K8s).

The control plane components

  • API server (kube-apiserver) — the only component anything talks to. Every kubectl command, every controller, every kubelet — they all hit the API server. It validates requests, persists them to etcd, and serves a watch stream that lets clients subscribe to changes. Conceptually: a REST API in front of etcd, with auth and validation.
  • etcd — a distributed key-value store. The single source of truth for the entire cluster. Lose etcd, lose the cluster. Everything else is stateless and rebuildable; etcd is sacred.
  • Scheduler (kube-scheduler) — watches for unscheduled pods and decides which node each one should run on. Considers resource requests, taints, affinity rules, etc. Doesn’t actually start the pod — just writes the assignment back to etcd.
  • Controller manager (kube-controller-manager) — a single binary running dozens of small controllers. The Deployment controller, the ReplicaSet controller, the Node controller, the Endpoints controller, etc. Each one is a separate reconciliation loop watching its own slice of the world.
  • Cloud controller manager — controllers specific to your cloud provider (provisioning load balancers, managing node lifecycle, etc.). Separated out so the core stays cloud-agnostic.

Workload primitives

  • Pod — the smallest schedulable unit. Not a container — a group of one or more containers that share a network namespace (same IP, same localhost) and storage volumes. 95% of pods have exactly one container. You almost never create pods directly.
  • ReplicaSet — keeps N copies of a pod running. If one dies, it creates a new one. Crude on its own; you don’t usually create these directly either.
  • Deployment — the workload object you actually use for stateless services. A Deployment manages a ReplicaSet, which manages pods. The Deployment layer adds rolling updates, rollback, and revision history.
  • StatefulSet — like a Deployment, but each pod gets a stable identity (my-app-0, my-app-1, …) and stable persistent storage. For databases, queues, anything where the pods are not interchangeable.
  • DaemonSet — runs exactly one copy of a pod on every (or every selected) node. For node-level things: log collectors, monitoring agents, CNI plugins.
  • Job — runs a pod to completion and stops. CronJob schedules Jobs.

Networking and service discovery

  • Pod IP — every pod gets its own routable IP. Pods can reach each other directly without NAT. (How this works depends on the CNI plugin — see below.)
  • Service — a stable virtual IP that load-balances to a set of pods matching a label selector. Pods come and go; the Service IP stays. The fundamental abstraction for “I want to talk to my backend without caring which pod handles it.”
  • Service types: ClusterIP (internal only — the default), NodePort (also exposed on every node’s IP at a high port), LoadBalancer (provisions a cloud load balancer), ExternalName (a DNS CNAME, no pods).
  • Ingress — HTTP-layer routing rules: “send api.example.com/users to this Service, /orders to that one.” Implemented by an Ingress controller (nginx, Traefik, an AWS ALB) — Kubernetes itself doesn’t do HTTP routing, it just stores the rules.
  • CNI (Container Network Interface) — the plugin contract for pod networking. Kubernetes doesn’t implement networking; it shells out to a CNI plugin (Calico, Cilium, Flannel, AWS VPC CNI). The plugin is responsible for assigning pod IPs and making pods able to reach each other.
  • CoreDNS — runs as a Deployment in the cluster and serves DNS for Service names. my-svc.my-ns.svc.cluster.local resolves because CoreDNS watches Services and serves records for them.
  • kube-proxy — runs on every node, programs iptables (or IPVS, or eBPF) rules so that traffic to a Service IP gets DNAT’d to one of the backing pod IPs. The mechanism behind ClusterIP load balancing.

Configuration and secrets

  • ConfigMap — a key-value store for non-secret configuration. Mount as a file or expose as env vars.
  • Secret — like a ConfigMap, but for secrets. Base64-encoded by default — not encrypted. (Yes, this is misleading; we’ll come back to it.)
  • Namespace — a logical partition inside the cluster. Names within a namespace must be unique; across namespaces they can collide. Used for soft multi-tenancy: dev/staging/prod, or per-team boundaries.

Storage

  • Volume — a directory mounted into a pod. Many types: emptyDir (ephemeral, lives with the pod), hostPath (a directory on the node — usually a bad idea), and persistent volumes.
  • PersistentVolume (PV) — a piece of cluster-wide storage, often backed by an EBS volume, a GCE PD, an NFS export, or a Ceph RBD.
  • PersistentVolumeClaim (PVC) — a pod’s request for storage. The PVC is bound to a PV that satisfies it. Pods don’t reference PVs directly; they reference PVCs.
  • StorageClass — a template that defines how PVs get provisioned dynamically. “When a PVC asks for gp3-ssd, provision an AWS gp3 volume in this AZ.”
  • CSI (Container Storage Interface) — the plugin contract for storage drivers, parallel to CNI for networking.

Identity and access

  • ServiceAccount — a pod’s identity within the cluster. Different from a user account.
  • Role / ClusterRole — a set of permissions (“can read pods in namespace X”).
  • RoleBinding / ClusterRoleBinding — assigns a Role to a user, group, or ServiceAccount.

The thing that ties it all together

  • Controller — a control loop that watches some objects in the API server and tries to make reality match them. The Deployment controller watches Deployments and creates ReplicaSets. The ReplicaSet controller watches ReplicaSets and creates Pods. There are dozens of these. They are simple in isolation, complex in aggregate.
  • Reconciliation loop — the controller’s main loop: read desired state, read actual state, do whatever it takes to close the gap, repeat. Idempotent. Always running.
  • CRD (Custom Resource Definition) — a way to teach the API server about new object types. Combined with a custom controller, this is the Operator pattern: package operational knowledge as code that runs the reconciliation loop for your domain.

If you understood all of these, you already think more clearly about Kubernetes than 80% of people who use it daily. Forward to Section 5 if you want; the rest of this section is consolidation.


4. The Distilled Introduction

This is the section that replaces the 10-hour tutorial. We’ll walk from “I have a Docker image” to “I have a production-grade workload” in one go. No screenshots, no npm install ceremony, no recap segments.

The setup

You need three things to start: a cluster, kubectl, and a container image. For learning, run kind (brew install kind) or **minikube` to get a single-node cluster on your laptop in 30 seconds. For real work, use a managed cluster (EKS, GKE, AKS) — never self-host the control plane unless you have a compliance reason. Self-managed clusters are a tax that doesn’t pay back unless you genuinely need the control.

kubectl is the CLI. Configure it with kubectl config use-context <cluster>. Confirm it works:

kubectl cluster-info
kubectl get nodes

kubectl is just a REST client for the API server. Every command becomes an HTTP request. Anything kubectl does, you could do with curl.

Your first deployment

A Deployment is the right primitive for any stateless service. Here’s a minimal one — a single pod running nginx:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: nginx
        image: nginx:1.27
        ports:
        - containerPort: 80

A few things to internalize about this YAML, because they recur everywhere:

  • apiVersion + kind identify the object type. apps/v1 is a group/version; the API server has many groups.
  • metadata.name uniquely identifies the object within its namespace.
  • spec is desired state — what you want to be true. The controller will read this and act.
  • status (not shown — it’s filled in by the controller) is observed state. You read status, you write spec.
  • selector.matchLabels is how the Deployment finds “its” pods. The pod template has matching labels. This labels-and-selectors pattern is everywhere. Services find pods this way too. Get used to it.
  • replicas: 3 says you want three. The Deployment controller will create a ReplicaSet, which will create three Pods. The scheduler will assign each Pod to a Node. Each kubelet will pull the image and run it. All from this 16 lines of YAML.

Apply it:

kubectl apply -f web.yaml
kubectl get pods
kubectl get deployments
kubectl get replicasets

You’ll see one Deployment, one ReplicaSet, three Pods. The ReplicaSet’s name is web-<hash>; the pods are web-<hash>-<random>. The hash is a fingerprint of the pod template — when you change the spec, you get a new ReplicaSet with a new hash, which is how rolling updates work (more on this below).

Exposing it

Pods have IPs, but those IPs change when pods die. To talk to your three nginx pods reliably, you create a Service:

apiVersion: v1
kind: Service
metadata:
  name: web
spec:
  selector:
    app: web
  ports:
  - port: 80
    targetPort: 80

Apply it. Now web.default.svc.cluster.local resolves to a stable virtual IP, and traffic to that IP gets load-balanced across whichever pods currently match app: web. From any other pod in the cluster, curl http://web works (DNS search domains handle the namespace suffix).

This is ClusterIP, the default and the right choice 80% of the time. To expose this to the outside world:

  • type: NodePort — same Service, also reachable on every node’s IP at a high port (30000–32767). Crude but works. Mostly for dev clusters and on-prem.
  • type: LoadBalancer — same Service, plus your cloud provider provisions an external load balancer. Easy and expensive (one ELB per Service).
  • An Ingress — better for HTTP. You run one Ingress controller (which itself is a LoadBalancer Service), then create Ingress objects with host/path routing rules pointing to ClusterIP Services. One LB, many routes.

For HTTP services in production, use an Ingress (or its successor, the Gateway API). Don’t expose every Service as a LoadBalancer.

Rolling updates

This is one of the things Kubernetes genuinely makes effortless. Change image: nginx:1.27 to image: nginx:1.28 in the YAML and kubectl apply again. What happens:

  1. The Deployment controller notices the pod template changed. It computes a new hash. It creates a new ReplicaSet with the new hash, scaled to 1.
  2. Once that pod is Ready, the controller scales the new ReplicaSet up by one and the old one down by one.
  3. Repeat until the new ReplicaSet has three replicas and the old one has zero.

The defaults are sensible: maxSurge: 25% (you can have up to 25% extra pods during the roll) and maxUnavailable: 25% (you can have up to 25% missing). For a Deployment of three pods, this means at any given moment you have between 3 and 4 pods running, and at least 2 of them are ready. Zero downtime, if your readiness probe is correct.

Rollback is kubectl rollout undo deployment/web. Pause a rolling update mid-flight with kubectl rollout pause. Watch progress with kubectl rollout status deployment/web. These commands are worth remembering — they’re the Kubernetes equivalent of git revert.

Configuration and secrets

You don’t bake config into images. You inject it. Two ways:

# ConfigMap — non-secret
apiVersion: v1
kind: ConfigMap
metadata:
  name: web-config
data:
  log_level: info
  feature_x: "true"

In the pod spec, mount it:

env:
- name: LOG_LEVEL
  valueFrom:
    configMapKeyRef:
      name: web-config
      key: log_level

Or mount it as a file at /etc/config/log_level. Apps that read config from disk (Postgres, nginx, Java apps with application.properties) usually want the file form.

Secrets are almost the same object with one critical difference: at rest in etcd, Secrets are only base64-encoded by default — that is not encryption. Anyone with read access to etcd or to the Secret object via the API can read them. To get actual encryption-at-rest, you must enable the encryption-at-rest provider in the API server. For real secret management, integrate with an external system (HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager) using the External Secrets Operator or similar. Treat in-cluster Secrets as “slightly safer than ConfigMaps,” nothing more.

Health checks (probes)

This is the single most under-configured thing in real-world Kubernetes. Without probes, Kubernetes only knows whether your process is alive — not whether your application is healthy. You almost always want both:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  • Liveness — “is this container fundamentally broken?” If it fails, kubelet restarts the container. Should check internal invariants (deadlock detection, etc.). Should not check downstream dependencies — if your DB is down, restarting your pod won’t help, it’ll just thrash.
  • Readiness — “should this pod receive traffic?” If it fails, the Service excludes the pod from its endpoints. Should check downstream dependencies and warmup state.
  • Startup probe (newer) — “is this thing done starting up yet?” Use for slow-starting JVM-style apps; disables liveness until startup succeeds.

Get these right. Most production K8s pain that looks like Kubernetes’ fault is actually misconfigured probes.

Resource requests and limits

Every container should declare what it needs:

resources:
  requests:
    cpu: 100m       # 0.1 CPU cores
    memory: 256Mi
  limits:
    memory: 512Mi
  • Requests — the floor. The scheduler uses requests to decide where to place a pod (“does this node have 100m of CPU and 256Mi of memory free?”). Once placed, the pod is guaranteed to receive at least its requests.
  • Limits — the ceiling. CPU over-limit causes throttling. Memory over-limit causes OOMKill (the process is killed; the container restarts; if it keeps happening, you get CrashLoopBackOff).

The standard advice for years was “always set both.” The modern, hard-won advice is more nuanced: always set memory request = memory limit, but do not set CPU limits for latency-sensitive services. This is a controversial position, and we’ll dig into it in the Judgment Calls section. For now, accept that resource configuration is a craft, not a checklist.

Storage

For stateless services, you don’t need storage; let pods write to ephemeral disk and lose it when they die. For anything stateful — a database, a cache that takes 10 minutes to warm up, file uploads — you need a PersistentVolumeClaim:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
spec:
  accessModes: ["ReadWriteOnce"]
  resources:
    requests:
      storage: 100Gi
  storageClassName: gp3

Mount it in the pod:

volumes:
- name: data
  persistentVolumeClaim:
    claimName: postgres-data
containers:
- ...
  volumeMounts:
  - name: data
    mountPath: /var/lib/postgresql/data

Three things bite people here. First, ReadWriteOnce means “mountable by one node at a time” — most cloud block storage is RWO. Don’t expect to mount the same EBS volume on three pods on three nodes. If you need that, use a network filesystem (EFS, NFS) with ReadWriteMany, or accept that your pods are tied to one AZ. Second, persistent volumes are usually AZ-locked — if your pod gets rescheduled to a node in another AZ, the volume can’t follow. Third, deleting a PVC may or may not delete the underlying volume, depending on the StorageClass’s reclaimPolicy. Check this before you assume kubectl delete is reversible.

For stateful workloads with multiple replicas (a database cluster, a Kafka cluster), use a StatefulSet, not a Deployment. StatefulSets give each pod a stable identity (db-0, db-1, db-2), a stable hostname, and a per-pod PVC. The pods are no longer interchangeable, which is exactly what your database replication logic assumes.

Namespaces, labels, annotations

Use namespaces to separate environments and teams. dev, staging, prod, or team-payments, team-search. They’re a soft boundary — RBAC lives at the namespace level, but they don’t isolate networking by default (every pod can reach every other pod across namespaces unless you write NetworkPolicies).

Labels are queryable metadata you put on objects. app: web, env: prod, team: payments. They’re how Services find pods, how Deployments find their pods, how you select things with kubectl get pods -l env=prod. Labels are load-bearing — if you change a pod’s labels, controllers that selected on those labels will treat it as gone.

Annotations are non-queryable metadata. For human notes, integration hints, audit info (“last deployed by alice@company.com”). Annotations don’t affect scheduling or selection.

The kubectl commands you’ll use 50 times a day

kubectl get pods                    # list pods
kubectl get pods -A                 # all namespaces
kubectl get pods -o wide            # with node and IP
kubectl describe pod <name>         # full status, events, recent failures
kubectl logs <pod>                  # logs
kubectl logs <pod> --previous       # logs from the *previous* (crashed) instance
kubectl logs -f <pod>               # follow
kubectl exec -it <pod> -- bash      # shell into a container
kubectl port-forward <pod> 8080     # tunnel localhost:8080 to the pod
kubectl apply -f <file>             # create or update from YAML
kubectl delete -f <file>            # delete
kubectl rollout status deploy/<n>   # watch a rolling deploy
kubectl rollout undo deploy/<n>     # roll back
kubectl edit deploy/<name>          # edit live, danger
kubectl get events --sort-by=.lastTimestamp

The single most useful one for debugging is kubectl describe pod <name>. It shows you events (which often have the actual error in plain English), container state, last termination reason, exit code, and probe failures. When something is broken, start there.

The autoscalers

Two things scale automatically: pods and nodes.

  • Horizontal Pod Autoscaler (HPA) — adjusts a Deployment’s replica count based on metrics (CPU usage by default; can be custom metrics like queue depth). Add an HPA, traffic goes up, replicas go up.
  • Vertical Pod Autoscaler (VPA) — recommends or auto-applies new resource requests based on observed usage. Less commonly used auto-applying, very commonly used in recommendation mode.
  • Cluster Autoscaler — adds and removes nodes based on whether pending pods can be scheduled. Uses your cloud’s auto-scaling groups.
  • Karpenter (AWS, increasingly elsewhere) — a smarter Cluster Autoscaler that picks instance types dynamically based on what the pending pods need.

The classic gotcha: the HPA and Cluster Autoscaler interact. The HPA wants more pods → pods are pending because nodes are full → Cluster Autoscaler adds a node → HPA’s pods schedule. This dance has 2–5 minutes of latency. Don’t expect autoscaling to handle a flash crowd. Pre-warm.

Helm

Real applications aren’t one Deployment and one Service. They’re 5 Deployments, 12 ConfigMaps, 8 Secrets, 4 Services, an Ingress, RBAC, NetworkPolicies — and you want to deploy this to dev, staging, and prod with slight variations. Plain YAML doesn’t compose. The two answers are Helm (template-based packages with values.yaml overrides) and Kustomize (overlay-based, no templates). Helm dominates the ecosystem; every off-the-shelf thing (Postgres operators, Prometheus, ingress controllers) ships as a Helm chart. Learn enough Helm to install third-party charts. For your own apps, Kustomize is often cleaner — no templating language, fewer ways to shoot your foot.

What you’ve learned

You now know enough to deploy a real service: a Deployment with probes and resource configs, a Service in front of it, an Ingress with TLS, a ConfigMap and Secret for config, a PVC for state, an HPA for scale. Add monitoring (Prometheus) and logs (Fluent Bit / Loki) and you have a production-grade workload. The next sections explain why this works the way it does, and where the mines are buried.


5. The Mental Model

Four ideas. If you internalize these, almost everything Kubernetes does stops being surprising. Almost.

Core Idea 1: Desired state is the API. Reality is downstream.

You do not give Kubernetes commands. You describe a world you want to exist. “There should be three pods running this image, exposed at this IP.” You write that into the API server, the API server writes it into etcd, and that’s the end of your turn.

Then a horde of controllers reads that desired state, looks at reality, and starts working to close the gap. The Deployment controller sees “you want three pods” and makes a ReplicaSet. The ReplicaSet controller sees the ReplicaSet and creates Pod objects. The scheduler sees unscheduled Pods and assigns them to Nodes. The kubelet on each Node sees “you should be running this Pod” and starts the container.

This predicts:

  • Why kubectl apply returns instantly even though nothing has happened yet. It returns when the object is persisted, not when reality matches. “It worked” means “I wrote your YAML to the database.” Reality follows, eventually.
  • Why everything Kubernetes does is asynchronous. There is no synchronous “deploy this and wait.” The closest you get is kubectl rollout status, which polls the controller’s reported status until it converges. Convergence is the API; commands are a fiction kubectl puts on top.
  • Why you can kubectl delete a pod and it comes back. Because the Deployment that owns it still says “I want 3 pods.” You deleted the symptom, not the desired state. Delete the Deployment and the pods go away too — because now the desired state is “no pods.”
  • Why Kubernetes self-heals. It’s not a special feature. It’s just that the controllers don’t stop reconciling when you stop watching. A node dies, its pods get rescheduled to other nodes, because the ReplicaSet controller is still trying to satisfy replicas: 3. The healing is the same code path as the initial scheduling. There is no separate “healing logic.”
  • Why the right way to fix things is usually to change the spec, not to take an action. Don’t kubectl exec into a pod and edit /etc/config. Update the ConfigMap and let the reconciliation propagate. The desired state should be the truth, always.

Core Idea 2: Everything is a controller. Controllers are dumb. Their composition is what’s smart.

The control plane isn’t one big intelligence. It is dozens of tiny, single-purpose loops. The Deployment controller does one thing: when a Deployment’s pod template hash changes, manage two ReplicaSets to do a rolling update. It does not know about pods. It does not know about nodes. It only knows ReplicaSets.

The ReplicaSet controller does one thing: keep N pods alive. It does not know about Deployments. It does not know about images or rolling updates.

The scheduler does one thing: take pods with no nodeName set and assign one. It does not start them. It does not know if the node is healthy. It just writes a name back to the object.

The kubelet does one thing: for the pods assigned to my node, make them be running. It doesn’t know they’re part of a Deployment. It doesn’t know about Services.

This predicts:

  • Why Kubernetes is extensible. You can add your own controller (an Operator) that watches your own Custom Resource and creates Deployments, Services, etc. — just like the built-in controllers do. There is no special API for built-ins. You play by the same rules.
  • Why intermittent control-plane outages don’t kill workloads. If etcd or the API server goes down, controllers can’t reconcile, but kubelets keep running the pods they already know about. The cluster freezes; running workloads keep serving traffic. (This is partly true and partly a comforting lie — see the Monzo postmortem in Section 11.)
  • Why so many things in Kubernetes feel “loosely coupled to the point of being unhelpful.” A Service has a label selector. Pods have labels. Nothing checks at create time that the Service’s selector matches any pods. It’s just two objects in etcd that happen to have a relationship. You can have a Service with no endpoints (selector matches nothing) and Kubernetes will not warn you. The Endpoints controller will simply observe “no matching pods” and report zero endpoints.

Core Idea 3: Labels and selectors, not pointers.

Things in Kubernetes don’t reference each other by ID. They match on labels. A Deployment doesn’t say “I own these pod IDs.” It says “I own everything matching app: web, pod-template-hash: 7a8f...”. A Service doesn’t say “send to these specific pods.” It says “send to anything labeled app: web.”

This is enormously powerful and occasionally horrifying.

This predicts:

  • Why the Deployment controller can survive losing all its pods. The controller queries the API for pods matching its selector, finds zero, creates more. It doesn’t keep state outside etcd.
  • Why you can manually create a pod with the right labels and a Service will route to it. This is occasionally useful for debugging — bring up a pod with app: web and watch traffic flow to it.
  • Why label changes on a running pod are dangerous. Change app: web to app: web-debug and the Service will stop routing to it (no longer matches selector), the ReplicaSet will think one of its pods is missing and create a new one. The “debug” pod is now an orphan that nothing manages. This is occasionally what you want — kubectl label pod web-xyz app=quarantine is a real debugging move — but only if you understand it.
  • Why two Deployments with overlapping selectors fight each other. The Kubernetes docs warn about this; it’s not enforced. Two ReplicaSets, each thinking they own the same pod, will continuously create and delete pods. This is a real outage mode, not a thought experiment.

Core Idea 4: The pod is the atom of scheduling. Containers are an implementation detail.

A pod is not “a container in Kubernetes.” A pod is “a co-located group of containers that share a network namespace and a lifecycle.” 95% of pods have one container. The 5% that don’t are using sidecars.

A sidecar is a second container in the same pod, running alongside your main app, sharing its localhost. Common patterns: a log forwarder reading from a shared emptyDir volume; a service mesh proxy (Istio’s Envoy, Linkerd’s Linkerd2-proxy) that intercepts your network traffic; a config reloader watching a remote source and writing to the volume your main app reads.

This predicts:

  • Why pod IPs work the way they do. All containers in a pod share the same IP. They reach each other via localhost. The pod is the network endpoint, not the container.
  • Why service meshes work transparently. Istio injects an Envoy sidecar into every pod, and configures iptables in the pod’s network namespace to redirect all traffic through the sidecar. Your app doesn’t know. The mesh sees and controls all traffic. (The cost: every pod now has two containers, two log streams, twice the failure surface, and a hard dependency on the mesh control plane. There’s no free lunch.)
  • Why “restart the container” doesn’t always mean “restart the pod.” kubelet can restart a container in place when its liveness probe fails — same pod IP, same volumes, same sidecars. The pod object is unchanged. Compare to “the pod gets evicted and rescheduled,” which is a whole different blast radius.

Putting them together

The four ideas reinforce each other. Desired state is the API (1) → controllers reconcile against it (2) → controllers find their work via labels, not IDs (3) → the unit of work they manipulate is the pod, not the container (4). The whole edifice is built on those four bricks. When something surprising happens, you can almost always trace the surprise back to one of them. “Why did my pod come back after I deleted it?” — desired state. “Why is nothing happening when I created my Deployment?” — controllers. “Why is my Service routing to a pod I don’t think it should?” — labels. “Why is my CPU limit affecting another container in the same pod?” — pod-as-atom (cgroups are at the pod level, sometimes).

You’ll meet these four ideas wearing different costumes for the rest of your career with this technology.


6. The Architecture in Plain English

Let’s walk through what actually happens when you kubectl apply -f deployment.yaml. This is the single most clarifying mental exercise for understanding Kubernetes — every other operation is a variation on this flow.

The journey of one Deployment

  1. kubectl parses the YAML and POSTs it to the API server. Authentication is via your kubeconfig (a cert, a token, or an OIDC integration). The API server checks: are you allowed to create Deployments in this namespace (RBAC)? Does the YAML schema validate? Do any admission webhooks (custom validators or mutators) accept it? If yes, the Deployment object is written to etcd. The API server returns 201 to kubectl. As far as you’re concerned, you’re done. Total elapsed time: ~50ms.

  2. The Deployment controller wakes up. It has been watching the API server’s stream of changes (a watch on the Deployments resource) and just received a “Deployment created” event. It reads the new Deployment, compares to existing ReplicaSets owned by it (none), and creates a new ReplicaSet object. This is itself a POST to the API server, which writes the ReplicaSet to etcd.

  3. The ReplicaSet controller wakes up. It sees a new ReplicaSet that wants three replicas, but currently has zero pods. It creates three Pod objects. (Each one is a POST to the API server, written to etcd.) The Pods are created with nodeName unset — they’re unscheduled.

  4. The scheduler wakes up. It sees three new unscheduled Pods. For each, it runs the scheduling algorithm: filter nodes that can fit the pod (enough CPU, memory, matching nodeSelector, tolerating taints, etc.), score the survivors, pick the best one. It writes the chosen nodeName back to each Pod object via the API server.

  5. kubelet on the chosen node wakes up. Each kubelet watches the API server for pods assigned to its own node. It sees a new pod, pulls the container image (cache hit if recent), creates the container via the container runtime (containerd), wires up the network via the CNI plugin (which assigns an IP, creates a veth pair, sets up routes), mounts any volumes via the CSI plugin, and starts the container. As the container becomes ready, kubelet updates the Pod’s status.

  6. Probes start running. kubelet runs the readiness probe. Once it passes, kubelet marks the pod as Ready: true.

  7. The Endpoints controller wakes up (assuming a Service selects this pod’s labels). It queries pods matching the Service’s selector, finds the new ready pod, and adds its IP to the Service’s EndpointSlice object. The Service’s “list of backends” now includes this pod.

  8. kube-proxy on every node wakes up. It watches EndpointSlices and updates iptables (or IPVS, or eBPF) rules so that traffic to the Service IP gets DNAT’d to one of the backend pods. New pods become routable.

  9. CoreDNS reflects the change if it cares (only relevant for headless Services). All Service DNS names continue to resolve as before.

That entire flow happens in 5–30 seconds for a fresh deployment, mostly bottlenecked on image pull. For a rolling update, it happens once per pod, sequentially.

The thing to notice: no central orchestrator coordinated this. Each component independently watched its slice of etcd, did its thing, and wrote results back. The choreography emerges from independent loops. There’s no if (deployment_created) then (create_replicaset, then create_pods, then schedule_them) anywhere. There’s just controllers, watching, reacting.

Where state actually lives

This is the answer to most “where would I look” questions:

  • Desired state, current observed state, identity, ownership — etcd. All of it. There is no other source of truth in the control plane.
  • The pods that are actually running on a node — the kubelet on that node knows, plus the container runtime. The API server’s view (status.containerStatuses) is the kubelet’s report and may lag.
  • Network rules — iptables on every node, programmed by kube-proxy. (Or IPVS rules, or eBPF programs in Cilium.) When a Service has 1000 endpoints, every node has rules for all 1000.
  • DNS records — CoreDNS’s in-memory state, populated by watching Services. There are no DNS zone files.
  • Persistent data — your storage backend (EBS, GCE PD, Ceph, NFS). Kubernetes only stores references to volumes (PV objects); the data lives elsewhere, fortunately, because etcd is the wrong place for it.
  • Logs and metricsnot in Kubernetes at all. Kubernetes doesn’t aggregate logs or metrics. You install separate systems (Loki, Prometheus, Datadog, Cloud Logging) for that. This surprises people.

The threading model

The control plane components are mostly written in Go and use Go’s concurrency model heavily. Each controller is its own loop with its own work queue; controllers don’t share state directly, they share through etcd. There is heavy use of informers — caches that subscribe to API server watches and provide controllers with fast local reads instead of hammering the API server. Without informers, every controller would saturate the API server with reads. With them, the API server mostly serves writes.

This is why the API server is the scalability bottleneck of a cluster. Big clusters (5000+ nodes, tens of thousands of pods) routinely run into API server limits. The mitigations are sharding the controller manager workload, increasing API server replicas, or — and this is what Google does internally — using a fundamentally different architecture (Borg). Kubernetes is not designed for the scale Borg is.

What “high availability” actually means

A “highly available” Kubernetes cluster runs three or more control-plane nodes, each with its own copy of the API server and a member of the etcd cluster. etcd uses the Raft consensus protocol, which requires a majority — so 3 nodes tolerates 1 failure, 5 tolerates 2. The API servers are stateless and load-balanced; you can run as many as you want.

Critically: HA of the control plane is independent of HA of your workloads. Your pods being in three replicas across three AZs doesn’t depend on the control plane being HA. The control plane being HA doesn’t mean your service is HA. These are two separate axes. Most outages I’ve seen are workload-availability outages (one replica, one AZ, no PDB, no anti-affinity) rather than control-plane outages.


7. The Things That Bite You

These are the gotchas that make Kubernetes feel hostile in your first six to twelve months. Each one is a direct consequence of something in the mental model. None are bugs.

Gotcha 1: Memory limit triggers OOMKill, but CPU limit only throttles

You’d expect: limits are limits, going over them is bad in a similar way for both resources. What actually happens: exceeding a memory limit kills your container; exceeding a CPU limit just slows it down. CPU is a renewable resource, memory isn’t. The Linux kernel’s CFS quota mechanism enforces CPU limits by throttling the process — it pauses you when you’ve used your quota for the current 100ms window. Memory has no quota; you allocate or you fail, and Kubernetes’ enforcement is the OOM killer.

Why it bites: people set memory limits hoping for “graceful degradation under pressure” and instead get loud, sudden death — OOMKilled, exit code 137, then CrashLoopBackOff if it keeps happening. This connects to the QoS class concept: pods with requests == limits for both CPU and memory get the Guaranteed class and are killed last under node pressure. Pods with limits but no requests, or limits > requests, are Burstable and get killed earlier. Pods with neither are BestEffort and get killed first. You have a kill order in your cluster that activates exactly when you don’t want it to.

How to handle: set memory.requests = memory.limits for anything you care about. For workloads where bursts above the request are normal, monitor working-set size for two weeks, set the request at P95, set the limit with 30–50% headroom on top. Accept that “monitor and adjust” is the steady state — Kubernetes resource configuration is never “set and forget.”

Gotcha 2: CPU limits cause silent latency tail spikes

You set a CPU limit of 1 core on a service. The service averages 200m. On kubectl top pods, you see no CPU pressure. But your P99 latency is mysteriously 200ms while P50 is 5ms. What’s happening: the CFS quota mechanism allocates CPU time in 100ms windows. If your pod uses up its 100ms-worth of quota in the first 30ms of a window, it gets throttled for 70ms. From the OS’s perspective, the rest of the node is idle — but your pod is paused. P99 is exactly the request that hit the throttling window.

Why it bites: this connects to the pod-as-atom idea (cgroup limits are enforced by the kernel) and to CFS’s design (quota is per-window, not per-second). The defaults of CFS used to be even worse — there were kernel bugs where multiple threads would each accumulate throttle independently. They’re better now, but the fundamental pattern persists.

How to handle: for latency-sensitive services, don’t set CPU limits. Set CPU requests (so the scheduler reserves capacity), but leave the limit unset. The Kubernetes scheduler still ensures fair sharing via the request-to-share ratio. This is controversial advice that contradicts a decade of best-practice articles, but it’s what experienced operators and the relevant kernel maintainers (Tim Hockin, the K8s SIG-Node leads) have converged on. See: Henning Jacobs’ Zalando talks. Memory limits stay; CPU limits go.

Gotcha 3: Liveness probes can take down your service

You configure a liveness probe that hits /health, which queries the database. The database has a brief blip. Your liveness probe fails. Kubelet restarts the container. The container restarts and reconnects, but the database is still slow because of the original blip. Liveness probe fails again. CrashLoop. While you’re crashlooping, you’re not serving traffic. While you’re not serving traffic, your queue depth somewhere is exploding. Welcome to a self-amplifying outage.

Why it bites: liveness probes are about “is this process broken?” not “is the system healthy?” Confusing the two means you’ve turned a degraded-but-functional state into a hard outage. A perfectly healthy container can fail a liveness probe that’s actually checking external dependencies. Restarting it doesn’t help.

How to handle: liveness checks should test only internal invariants — deadlock detection, internal state corruption. Readiness probes can be more permissive about checking dependencies (failing readiness pulls you out of the load balancer; failing liveness restarts you). When in doubt, don’t configure a liveness probe at all. A pod that’s unable to serve traffic but isn’t dying will be removed from the Service via readiness, and a deployer will eventually notice. That’s almost always better than crash-restart loops.

Gotcha 4: DNS misery (it is always DNS)

Kubernetes’ DNS subsystem (CoreDNS) is fast under load but is the most common cluster-wide outage source. Symptoms: random services intermittently can’t reach random other services, then it clears up, then it comes back. The causes are many:

  • ndots: 5 and search domains. Pod /etc/resolv.conf defaults to ndots: 5, meaning any name with fewer than 5 dots is treated as relative and the resolver tries every search domain (5 of them) before trying the name as absolute. So curl api.external.com becomes 5 failed lookups + 1 successful one = 6x DNS load and added latency. Fix: add a trailing dot to fully-qualified names (api.external.com.), or set ndots: 2 in the pod’s dnsConfig.
  • Alpine’s musl libc DNS resolver sends A and AAAA queries in parallel and gets confused when one is slow. Toyota Connected hit this hard in 2018. Fix: use Debian-based images, or carefully test Alpine in your environment.
  • CoreDNS pod count is too low. Default deployments often have 2 CoreDNS pods. At thousands-of-QPS scale, they get overwhelmed. Fix: scale up, or use NodeLocal DNSCache (a DaemonSet that caches per-node).
  • Conntrack table races on the kernel. Under high DNS QPS, conntrack entries collide and DNS packets get dropped. Fix: enable single-request-reopen in the resolver, or use NodeLocal DNSCache.

Why it bites: DNS is the point where every service in your cluster touches a shared component (CoreDNS), the kernel’s networking stack, and resolver assumptions baked into your base image. It is genuinely the part of Kubernetes most likely to bite you in production, and it is genuinely worth dedicated investment.

How to handle: install NodeLocal DNSCache from day one. Set ndots: 2 for your pods unless you know you need 5. Use Debian-slim base images for anything that does heavy DNS. Monitor CoreDNS QPS and error rate explicitly.

Gotcha 5: PVCs don’t follow pods across AZs

You have a StatefulSet with a PVC backed by an EBS volume. The pod is in us-east-1a. The node in us-east-1a becomes unhealthy. Kubernetes wants to reschedule the pod. But the EBS volume can’t be attached to a node in us-east-1b. The pod is stuck in Pending forever. Cluster Autoscaler scales up a new node in us-east-1a if it can — but if your AZ is genuinely down, you’re stuck.

Why it bites: persistent volumes are usually AZ-locked. The cluster autoscaler and scheduler are aware of this in modern Kubernetes (volume zone predicates), but it’s still a fundamental limitation: a stateful pod’s blast radius is the AZ of its volume. This is why “stateful workloads in K8s” is so much harder than stateless — the orchestrator can’t fully abstract the failure domain.

How to handle: for stateful workloads, run a per-AZ StatefulSet (3 replicas, one per AZ, with anti-affinity). Use storage that supports replication at the storage layer (Portworx, Longhorn, or your DB’s own replication). For the most critical stateful workloads — your primary database — strongly consider not running them in Kubernetes at all. Use the cloud provider’s managed RDS/CloudSQL/Spanner.

Gotcha 6: Two Deployments with overlapping selectors will eat each other

You have a Deployment with selector: app=api. You create a second Deployment with selector: app=api (different name, same selector). Both controllers think they own all the pods matching that selector. Both try to enforce their replica counts. They create and delete each other’s pods continuously. Kubernetes will not stop you. The docs explicitly warn about it; the system does not.

Why it bites: this is Core Idea 3 (labels, not pointers) at its most brutal. There is no notion of “ownership” enforced at create time; ownership is observed by selectors at runtime, and conflicting observations lead to chaos.

How to handle: treat label selectors as namespaced. Always include app: <unique-name> in your selector. Modern tooling (Helm, Kustomize) generally handles this for you, but check before pasting things together.

Gotcha 7: kubectl apply and three-way merge surprises

kubectl apply doesn’t just overwrite the live object with your YAML. It does a three-way merge between (a) your YAML, (b) the live state, and (c) the previously applied version (stored in an annotation). Fields you removed from your YAML are removed from the live state — but only if they were in your previous YAML. Fields you never managed (e.g., set by an admission controller) are left alone.

This sounds reasonable until you do kubectl edit on something, then kubectl apply later, and your edit gets clobbered — or doesn’t, depending on which fields. Or you switch from apply to replace, lose the annotation, and the merge logic gets weird.

Why it bites: Kubernetes wants both human-friendly imperative tools (kubectl edit, kubectl scale) and GitOps-friendly declarative ones (kubectl apply). They don’t compose perfectly. The consequence is that “the YAML in git” is not always exactly what’s in the cluster, even when you think it is.

How to handle: pick a model and stick with it. If you’re using GitOps (ArgoCD, Flux), let the tool be the only thing applying changes. Don’t kubectl edit on objects under GitOps management. For everything else, prefer kubectl apply from version-controlled YAML.

Gotcha 8: Helm hooks, finalizers, and the “I can’t delete this thing” trap

You kubectl delete namespace foo. It hangs forever in Terminating. The cause is almost always a finalizer — an annotation on some object inside the namespace that prevents deletion until something cleans up. Common culprits: a CRD’s controller is gone but its finalizer is still on its CRs; a PVC is finalized by a CSI driver that’s no longer running; a webhook is configured but unreachable.

Why it bites: finalizers are how Kubernetes ensures cleanup logic runs before deletion. If the cleanup logic is broken (controller gone, webhook unreachable), the object is wedged. The system is “correct” — it’s waiting for a promised cleanup that never comes — but indistinguishable from a hang.

How to handle: find the stuck object (kubectl get all,pvc,configmap -n foo), inspect its metadata.finalizers, and either fix the underlying controller or — only as a last resort — kubectl patch the finalizer off (kubectl patch <obj> -p '{"metadata":{"finalizers":null}}' --type=merge). The latter is the K8s equivalent of git push --force: it works, it’s irreversible, and you’ll regret using it casually.

Gotcha 9: A misconfigured admission webhook can take down your cluster

An admission webhook intercepts every API write and validates or mutates it. If you configure a webhook with failurePolicy: Fail and the webhook becomes unreachable (its pods are evicted, its Service has no endpoints, network is broken), every API write fails. Including writes to fix the webhook. Including writes to evict pods. Sometimes including kubelet’s heartbeat updates if you scoped the webhook too broadly.

This is not theoretical. Jetstack documented exactly this taking down a cluster. The recovery requires editing the webhook configuration directly in etcd, because you can’t change it through the API server because the API server is rejecting all writes because the webhook is unreachable.

Why it bites: admission webhooks are how Kubernetes is extensible at the boundary, but they are in the synchronous critical path of every write. A broken webhook is an outage of all writes to the resources it scopes.

How to handle: use failurePolicy: Ignore for non-critical webhooks. Scope webhooks tightly (specific namespaces, specific resources). Always exclude the kube-system namespace from your webhooks. And run your webhook controller in HA — at least 2 replicas with a PodDisruptionBudget.

Gotcha 10: latest tag and ImagePullPolicy: Always

Using image: my-app:latest plus the default imagePullPolicy: Always means every pod restart pulls the image fresh from the registry. This introduces a hard dependency on registry availability for every restart. A registry outage during a deployment becomes a cluster-wide outage as pods get evicted and can’t restart. Worse, “latest” is mutable — you can have different replicas of the “same” Deployment running different builds, depending on when each pulled.

Why it bites: combining mutable tags with always-pull gives you maximum freshness at the cost of maximum brittleness.

How to handle: use immutable tags. Tag images with the commit SHA or a semver. imagePullPolicy: IfNotPresent is then safe and faster. The 2-second time you save by not retyping is worth keeping. Production builds should never reference :latest.


8. The Judgment Calls

This is where experienced and beginner engineers visibly diverge. Each call has multiple defensible answers; the experienced ones know what they’re optimizing for.

Call 1: Managed Kubernetes vs. self-managed

  • Managed (EKS/GKE/AKS): the cloud provider runs the control plane. You only manage worker nodes. Costs ~$70–$150/month per cluster for the control plane itself, plus your worker costs. Upgrades are mostly handled. etcd backups, certificate rotation, API server HA — handled.
  • Self-managed (kubeadm, Rancher, kops): you run everything. Full control, no per-cluster fee, but you’ve signed up for being on-call for etcd.

What experienced engineers do: managed. Always. Unless you have a compliance reason that mandates on-prem and no managed offering exists, the cost of self-managing your control plane is dramatically larger than the cluster fee. Self-management is a $300k/year platform engineer commitment. The decision is “do we have a regulatory or air-gap requirement that forces this?” If no, it’s managed.

The signal: if you find yourself debating this, you don’t have a reason. The teams that have a reason know they have a reason within 30 seconds.

Call 2: One big cluster vs. many small clusters

  • One cluster: simpler operationally, easier service discovery, lower per-cluster overhead. But shared blast radius — a bad deploy can affect everyone.
  • Many clusters: isolation per team / per environment / per region. But every cluster needs upgrading, monitoring, secrets distribution, RBAC. The operational tax is roughly linear in cluster count.

What experienced engineers do: one cluster per environment per region. So prod-us-east, prod-eu-west, staging-us-east, dev. Not one cluster per team, not one cluster per service. The right axis to split on is failure domain (regions, environment) not organizational chart (teams).

The signal: if you’re tempted to give each team its own cluster “for isolation,” you’re solving an RBAC problem with a cluster. Use namespaces and RBAC. If you have actual regulatory or compliance reasons that one team’s data can’t share infrastructure with another’s, then clusters. Otherwise, namespaces.

Call 3: Should this stateful workload run in Kubernetes?

  • Yes: the operational tooling is in K8s, you already have a great operator (CockroachDB, Redis Operator, Strimzi for Kafka), you want consistent ops.
  • No: use the cloud’s managed service (RDS, Cloud SQL, MemoryStore). Less flexibility, but the most painful failure modes (split brain, data corruption, slow recovery) are someone else’s problem.

What experienced engineers do: the database is the last thing to go into Kubernetes, not the first. For your primary OLTP database — the one whose corruption ends the company — use the managed service. RDS-style services have many operational dimensions (failover, backup verification, point-in-time restore, replica lag monitoring) that operators can do but rarely match. Run Kafka, Redis, search indices in K8s freely. Run primary Postgres there only if you have a dedicated team that wants to and a strong reason. The default answer is “no” and the burden of proof is on yes.

The signal: “we want everything in K8s for consistency” is not a strong reason. “Our database team wants to use the operator because it gives them per-tenant isolation that RDS doesn’t” is a strong reason.

Call 4: Helm vs. Kustomize vs. raw YAML vs. CDK8s

  • Helm: template-based. Massive ecosystem of charts. Templates are Go templates inside YAML — fragile, hard to debug.
  • Kustomize: overlay-based. No templating language. Built into kubectl. Composes nicely. Less expressive than Helm.
  • Raw YAML: no abstraction at all. Fine for small projects, doesn’t scale across environments.
  • CDK8s / Pulumi / cdktf: code (TypeScript, Python) that emits YAML. Best ergonomics for complex cases, biggest learning curve.

What experienced engineers do: Kustomize for your own apps, Helm for installing third-party charts. This is the practical answer. Your own app’s manifests live in Kustomize because it’s simpler, your platform manifests (Prometheus, Cert-Manager, ingress controllers) come as Helm charts, and you don’t try to convert them. CDK8s is great in theory; in practice, the “everyone reads YAML” property of plain Kustomize is worth a lot.

The signal: if you’re considering Helm for your own internal app, ask whether you actually need templating. Most internal apps don’t have enough variability across environments to warrant a templating language.

Call 5: Set CPU limits or not?

  • Set CPU limits: predictable resource accounting, “fair” usage, what every tutorial says.
  • Don’t set CPU limits: avoid CFS throttling, much better tail latency, but you can starve other pods if a runaway hits.

What experienced engineers do: for latency-sensitive services, set requests but not limits. Memory limits stay (memory has no graceful throttle). CPU limits go. The argument is empirical: CFS throttling is real, painful, and observable in P99. The risk it protects against (one pod hogging CPU) is largely solved by setting accurate requests, which makes the scheduler honor proportional shares.

For batch jobs, CronJobs, and anything where occasional throttling is fine, keep CPU limits.

The signal: if you’ve ever debugged a service whose P99 latency was inexplicable while CPU usage looked low, look at container_cpu_cfs_throttled_seconds_total. If you see throttling, removing the CPU limit will probably fix it.

Call 6: Service mesh or not?

  • Service mesh (Istio, Linkerd, Cilium service mesh): mTLS between all services, retries and circuit breaking without app changes, fine-grained traffic shifting (canary, A/B). Cost: a sidecar per pod (or eBPF), substantial control-plane operation, real CPU/memory overhead, real latency overhead (1–3ms per hop, more if misconfigured).
  • No mesh: retries and timeouts in your apps’ HTTP clients, mTLS terminated at ingress only, simpler ops.

What experienced engineers do: start without a mesh. Add one when you have a concrete need that you can articulate without saying “best practice.” Concrete needs include: zero-trust mTLS as a regulatory requirement; multi-team retry/timeout policies you can’t enforce in client code; advanced traffic shifting for safer rollouts. If the answer is “we want observability,” install Prometheus first — meshes are an expensive way to get metrics.

If you do choose a mesh, prefer Linkerd over Istio for most teams. Linkerd is dramatically simpler and more focused. Istio is more capable but is its own platform-engineering team’s worth of work. Cilium’s eBPF-based mesh is the new entrant worth watching.

The signal: if your company has 5 services, you don’t need a mesh. If it has 500, you probably do. The crossover is somewhere around 30–50, and at that point the case is usually obvious.

Call 7: When to write an Operator (CRD + custom controller)

  • Write an Operator: you have domain logic that involves coordinating multiple Kubernetes resources, reacting to lifecycle events, running scheduled tasks. You want users to interact with high-level abstractions (MyDatabase) instead of the underlying primitives.
  • Don’t write an Operator: you can express your need as a Helm chart with some scripts. Or as a CronJob. Or a sidecar.

What experienced engineers do: writing an Operator is significantly harder than it looks. Idempotency is hard. Race conditions are hard. State convergence under partial failures is very hard. Most internal “operators” should be a Helm chart and a runbook. Write an Operator when: (a) you’re building a platform for many teams that need to provision your domain object themselves, (b) the lifecycle has multiple steps that need to be re-tried independently, (c) you need to react to changes in cluster state (scale events, pod failures) in domain-specific ways. Otherwise, don’t.

The signal: if your operator’s reconcile loop is mostly a switch on phase strings, you’re rebuilding a workflow engine in K8s controllers, badly. Use Argo Workflows or a dedicated workflow tool.

Call 8: HPA, VPA, or both?

  • HPA (horizontal): more pods when load goes up. Standard for stateless web tier.
  • VPA (vertical): bigger pods (more requests/limits) when usage goes up. Useful when load is a function of user count and not RPS.
  • Both: in theory, scale up first by adding pods, fall back to bigger pods if you can’t get more (e.g., per-user latency spikes). In practice, they fight unless carefully configured.

What experienced engineers do: HPA for any service with horizontally scalable load. VPA in recommendation-only mode for everything — it gives you a continuous “you should set requests to X” signal that’s better than guessing. VPA in auto-update mode only for very specific batch workloads. Avoid both managing the same metric on the same workload.

The signal: if your load is roughly proportional to RPS, HPA. If it’s roughly proportional to “how much state per pod,” consider VPA but verify it doesn’t break readiness probes during resize.

Call 9: Default-deny NetworkPolicies or default-allow?

  • Default-allow (Kubernetes’ default): every pod can talk to every other pod. Simple, productive, completely unsegmented.
  • Default-deny in every namespace, then explicit allows: the zero-trust model, Section-9-friendly, much harder to develop in.

What experienced engineers do: default-deny in production, default-allow in dev. This is one of the few “best practices” that genuinely earns its weight, especially after any kind of compliance scrutiny. Lateral movement after a single pod compromise is the difference between “one service breached” and “everyone breached.” NetworkPolicies are not optional in regulated environments.

The catch: NetworkPolicies require a CNI that supports them (Calico, Cilium, Weave — not default Flannel). Plan this when you pick the CNI, not after.

The signal: if you can’t articulate “what’s allowed to talk to my database pod?” within 30 seconds, you don’t have meaningful network segmentation, and you should.

Call 10: GitOps (ArgoCD/Flux) or imperative CI/CD?

  • GitOps: the cluster’s state is a function of a git branch. Changes are PRs. The cluster pulls state from git; nothing pushes.
  • Imperative CI/CD: Jenkins/GitHub Actions runs kubectl apply after a push. Simpler to set up, looser audit trail.

What experienced engineers do: GitOps once you’re past 5 services or 2 environments. The discipline of “the cluster is what’s in git, period” is enormously valuable for incident recovery, audit, and onboarding. ArgoCD is the dominant choice; Flux is a fine alternative. The main gotcha is configuring it not to fight against operators that mutate their own resources (mark those fields as ignored in the sync policy).

The signal: if you’re tempted to debug production by kubectl edit on a live object and you’re using GitOps, you have a process problem, not a tooling problem.

Call 11: How small to make your microservices

  • Big (modular monolith): simple deploy, no inter-service network, hard to scale teams.
  • Small (many services): team scaling, independent deploy. Operational overhead per service is real.

What experienced engineers do: Kubernetes makes it deceptively easy to add a new service — just another Deployment. So engineers do, until they have 50 services owned by 5 people. Now you have 50 deployment pipelines, 50 sets of dashboards, 50 things to monitor. Kubernetes does not solve microservice coordination overhead; it just hides the marginal cost.

The judgment: split a service when there’s a team boundary that needs separate deploy cadence, not when there’s a code module boundary. Many small modules in one Deployment is fine. Don’t split for purity.

The signal: if your “microservices” are all deployed by the same team in the same release, they’re not microservices, they’re modules with extra steps. Keep them in one process unless something is forcing the split.


9. The Commands and APIs That Actually Matter

This is the curated reference. Section 4 walked through the core flow; this is the cheat-sheet of what you actually reach for, organized by task. Each command is here because it earns its place — you’ll use these, not other ones.

Looking around

# Cluster context
kubectl config current-context             # which cluster am I on?
kubectl config use-context <ctx>           # switch
kubectl get nodes -o wide                  # see worker nodes, IPs, K8s version
kubectl cluster-info                       # control plane endpoint, key services

# Resources
kubectl get pods -n <ns>                   # the workhorse
kubectl get pods -A                        # across all namespaces (look for system issues)
kubectl get pods -l app=web                # filter by label — labels are how you slice
kubectl get all -n <ns>                    # everything in a namespace
kubectl get pods -o yaml                   # see the actual stored object
kubectl get pods --watch                   # live updates as state changes

kubectl get all is misleading — it doesn’t show ConfigMaps, Secrets, Ingress, or PVCs. To see “everything that matters” in a namespace: kubectl get all,cm,secret,ingress,pvc -n <ns>.

Debugging a sick pod

kubectl describe pod <name>                # the most useful command in K8s, full stop
kubectl logs <name>                        # current container logs
kubectl logs <name> --previous             # last instance's logs (for crashloops)
kubectl logs <name> -c <container>         # specific container in a multi-container pod
kubectl logs -f <name>                     # follow
kubectl logs --since=10m <name>            # time-bounded
kubectl exec -it <name> -- sh              # shell in (if the image has sh)
kubectl exec <name> -- env                 # see env vars without an interactive shell

kubectl describe is an information firehose. The sections that matter, in order: Events (at the bottom — read these first), State / Last State (current and previous container state, with exit code), Conditions (Ready, Initialized, PodScheduled), Containers (image, command, mounts, probes), Volumes.

Exit codes are diagnostic gold:

  • 0 — clean exit (your container thinks it’s done)
  • 1 — generic application error (read the logs)
  • 126 — permission denied (security context issue)
  • 127 — command not found (image issue)
  • 137 — SIGKILL (almost always OOMKilled, but check Reason)
  • 139 — segfault
  • 143 — SIGTERM (graceful shutdown)

Debugging the unschedulable pod

kubectl describe pod <name>                 # look at Events for "FailedScheduling"
kubectl get events --sort-by=.lastTimestamp # cluster-wide event stream
kubectl describe nodes                      # look at Allocatable vs Allocated
kubectl top nodes                           # current usage
kubectl top pods --containers               # per-container usage

The events tell you exactly why scheduling failed: “0/4 nodes are available: 4 Insufficient memory” or “1 node(s) had untolerated taint {node-role.kubernetes.io/master: }”.

Networking debug

kubectl get svc -n <ns>                                    # services and their ClusterIPs
kubectl get endpoints <svc>                                # the actual pods behind a service
kubectl get endpointslices -l kubernetes.io/service-name=<svc>  # the modern view
kubectl run -it --rm debug --image=nicolaka/netshoot -- bash    # network-tools-rich pod for testing
# inside that pod:
nslookup my-svc.my-ns.svc.cluster.local
curl my-svc.my-ns:80
nc -zv my-svc 80

netshoot is the canonical “network debugging pod.” It has dig, nslookup, curl, tcpdump, mtr, iperf — basically every tool you’d want, in a container.

If kubectl get endpoints <svc> shows zero endpoints: the Service’s selector matches no Ready pods. Either no pods match the labels or none are passing readiness. This is the #1 root cause of “my service isn’t reachable.”

Deployments and rollouts

kubectl rollout status deploy/<name>       # watch a rolling update
kubectl rollout history deploy/<name>      # revision history
kubectl rollout undo deploy/<name>         # roll back to previous
kubectl rollout undo deploy/<name> --to-revision=3
kubectl rollout pause deploy/<name>        # freeze partway through
kubectl rollout resume deploy/<name>
kubectl rollout restart deploy/<name>      # force a rolling restart (useful after secret/CM change)

kubectl rollout restart is the cleanest way to pick up a changed Secret or ConfigMap without changing the Deployment’s pod template. (Kubernetes does not auto-restart pods when their mounted ConfigMap changes; the file in the volume updates eventually, but env-var-injected values are baked at start.)

Editing live state

kubectl apply -f manifest.yaml             # the right way
kubectl edit deploy/<name>                 # interactive edit, last resort
kubectl scale deploy/<name> --replicas=5   # quick scaling
kubectl set image deploy/<name> nginx=nginx:1.28  # quick image bump (CI-friendly)
kubectl annotate <obj> key=value           # tag for your own purposes
kubectl label pod <name> app-              # remove a label (note the trailing minus)

kubectl edit opens $EDITOR on the live object’s YAML. You’re editing the live state, not a file. Save and quit applies. If you typo, the edit is rejected and you re-edit. Useful but easy to misuse — it bypasses your YAML in git.

Port-forwarding and proxying

kubectl port-forward pod/<name> 8080:80    # localhost:8080 → pod:80
kubectl port-forward svc/<name> 8080:80    # forward via Service (load-balances)
kubectl proxy                              # local proxy to API server (auth handled)

Port-forwarding is fragile (single TCP connection through kubectl). Fine for debugging, never for production traffic.

kubectl debug — the modern debugging tool

# Add a debugging container to a running pod
kubectl debug -it <pod> --image=busybox --target=<container>

# Create a copy of a pod with a different image (for fixing the broken image)
kubectl debug <pod> -it --copy-to=<pod>-debug --image=ubuntu

# Debug a node by running a pod with host filesystem access
kubectl debug node/<name> -it --image=ubuntu

Before kubectl debug (Kubernetes 1.18+), debugging required modifying the pod spec. Now you attach an ephemeral container — it shares the pod’s network and process namespace but is debugger-friendly. Critical for debugging distroless or scratch images.

YAML output and templating

kubectl get pod <name> -o yaml             # full YAML
kubectl get pod <name> -o json | jq        # JSON for scripting
kubectl get pods -o jsonpath='{.items[*].metadata.name}'  # extract specific fields
kubectl get pods -o custom-columns=NAME:.metadata.name,IP:.status.podIP

Generating YAML scaffolding without typing it:

kubectl create deployment web --image=nginx --dry-run=client -o yaml > deployment.yaml
kubectl create configmap config --from-file=app.conf --dry-run=client -o yaml
kubectl create secret generic db --from-literal=password=foo --dry-run=client -o yaml

The --dry-run=client -o yaml pattern is the fastest way to get a valid skeleton. Edit from there.

Useful aliases that pay back instantly

alias k=kubectl
alias kgp='kubectl get pods'
alias kgs='kubectl get svc'
alias kdp='kubectl describe pod'
alias klf='kubectl logs -f'
alias kx='kubectl exec -it'

Plus kubens and kubectx (separate tools — brew install kubectx) for fast namespace and context switching. These pay back the install cost in the first hour.


10. How It Breaks

Production Kubernetes breaks in characteristic ways. Knowing the failure modes saves you from rediscovering them at 3am.

The pod is Pending forever

Symptoms: kubectl get pods shows Pending for minutes.

Diagnosis: kubectl describe pod <name> and look at Events.

Common root causes:

  • “0/N nodes are available: N Insufficient cpu/memory” — your requests don’t fit. Either over-requested, or all nodes are full. Check kubectl describe nodes for Allocatable vs Allocated.
  • “0/N nodes are available: untolerated taint” — the pod doesn’t tolerate a taint on every node. Common for control-plane taints or GPU node taints.
  • “PVC is not bound” — a volume claim can’t find a matching PV. Check the StorageClass exists and has provisioner working.
  • “node selector doesn’t match” — nodeSelector is too strict.

Fix: depends on cause. For resource fit, either reduce requests, scale the cluster, or accept that Cluster Autoscaler is taking 60–120 seconds to provision new nodes.

CrashLoopBackOff

Symptoms: pod cycles between Running and Waiting; restart count climbs; the status column shows CrashLoopBackOff.

Diagnosis:

  1. kubectl describe pod <name> — look at exit code, Last State, Events
  2. kubectl logs <name> --previous — the previous instance’s logs, before the crash. This is the single most important command for crashloops; without --previous you get either nothing or the just-restarted container’s logs.

Common root causes:

  • Application bug (read the logs)
  • OOMKilled (exit 137) — bump memory limits or fix the leak
  • Missing config/secret — environment variable not set, file not mounted
  • Database not reachable at startup — add an init container that waits, or fix the readiness/liveness mix
  • Liveness probe is too aggressive — extend initialDelaySeconds or add a startupProbe

Fix: the backoff is exponential up to 5 minutes between restarts. If you fix the underlying issue, the next restart will succeed and the state clears.

ImagePullBackOff / ErrImagePull

Symptoms: pod is Waiting with status ImagePullBackOff.

Diagnosis: kubectl describe pod <name> — Events show the actual pull error.

Common root causes:

  • Image name typo (most common)
  • Tag doesn’t exist
  • Private registry without imagePullSecrets
  • Registry rate limiting (Docker Hub! It will rate-limit anonymous pulls)
  • Network blocking egress to the registry

Fix: for private registries, create a Secret of type kubernetes.io/dockerconfigjson and reference it in the pod’s imagePullSecrets. For Docker Hub rate limits, mirror through ECR or your cloud’s container registry.

OOMKilled

Symptoms: Last State Reason: OOMKilled, exit code 137.

Diagnosis: the kernel OOM killer terminated the container because it exceeded its memory limit, OR the node was under memory pressure and evicted the pod.

Common root causes:

  • Limit too low for actual working set
  • Memory leak in the application
  • Sudden traffic spike causing burst memory growth (tail-end caching, request batching)
  • JVM heap not sized to fit container limit (the JVM doesn’t auto-respect cgroup limits in older versions)

Fix: profile actual memory use over time (24+ hours under realistic load). Set request and limit both around P95 + 30–50% headroom. For JVM apps, set -XX:MaxRAMPercentage=75.0 so the heap respects the container limit.

Service has no endpoints

Symptoms: kubectl get endpoints <svc> returns empty list. curl my-svc times out.

Diagnosis:

  1. Does the Service’s selector match any pods? kubectl get pods -l <selector>
  2. Are those pods Ready? Check READY column in kubectl get pods.
  3. Are the pods listening on the port the Service targets? kubectl exec -it <pod> -- netstat -tlnp or just check the app’s logs.

Common root causes:

  • Pod selector mismatch (you typo’d a label)
  • Pods are running but failing readiness probe — most common
  • Pod is listening on 127.0.0.1 instead of 0.0.0.0 (app bound to localhost only)
  • targetPort doesn’t match containerPort

Fix: find which it is and fix it. If readiness is failing, look at the readiness probe config and the pod’s /health (or whatever) endpoint behavior.

DNS resolution intermittently fails

Symptoms: apps occasionally throw “name or service not known” errors. CoreDNS metrics show errors. P99 latency is wildly higher than P50.

Diagnosis:

  • kubectl logs -n kube-system -l k8s-app=kube-dns — look at CoreDNS errors
  • kubectl top pods -n kube-system — is CoreDNS CPU-saturated?
  • Check ndots setting in resolved pod’s /etc/resolv.conf

Common root causes:

  • CoreDNS replica count too low for your QPS
  • ndots: 5 causing 6x DNS load on external names
  • Conntrack races on high-DNS-QPS nodes
  • Alpine musl libc’s parallel A+AAAA query confusion

Fix: install NodeLocal DNSCache. Set ndots: 2 in pod dnsConfig for pods that talk to external services heavily. Scale CoreDNS to (number_of_nodes / 4) or so as a starting point.

Node NotReady

Symptoms: kubectl get nodes shows a node as NotReady. Pods on it stay Running but can’t be reached.

Diagnosis:

  • kubectl describe node <name> — the Conditions tell you what’s wrong (MemoryPressure, DiskPressure, NetworkUnavailable, kubelet not posting status)
  • SSH to the node, check systemctl status kubelet and journalctl for kubelet logs

Common root causes:

  • kubelet crashed or hung
  • Disk full on /var/lib/kubelet (image cache fills it)
  • CNI plugin failed (pod networking is dead)
  • Node is genuinely down (hardware/VM failure)
  • PLEG (Pod Lifecycle Event Generator) errors — kubelet timing out talking to container runtime, often caused by too many pods or runtime issues

Fix: if managed K8s, the cloud provider will replace the node automatically (eventually). If self-managed, restart kubelet, fix disk, debug CNI. After 5 minutes of NotReady, the controller manager starts evicting pods to other nodes (configurable via --pod-eviction-timeout).

The cluster-wide DNS or networking outage

Symptoms: lots of pods report can’t-reach errors at once. CoreDNS is down or unreachable. Cascade.

Diagnosis: check kube-system pods. Often this is CoreDNS pods being killed by an aggressive eviction or by a node outage that took out the only CoreDNS replicas.

The Monzo case is instructive: a deployment caused payment processing failures; a Linkerd-Kubernetes-etcd interaction caused new Linkerd instances to fail to receive network updates; the result was an empty endpoint list for critical services, which from an application’s perspective looks like “the database doesn’t exist.” The lesson: control-plane components and service mesh control planes are critical infrastructure with the same blast radius as your app’s database. Treat them with that respect.

General debugging workflow

When something is wrong and you don’t know what:

  1. kubectl get pods -A | grep -v Running — anything weird at all?
  2. kubectl get events --sort-by=.lastTimestamp -A | tail -50 — what’s just happened?
  3. kubectl describe <whatever-looks-broken> — read the Events section
  4. kubectl logs <pod> --previous if there’s been a restart
  5. kubectl top nodes and kubectl top pods — is the cluster under resource pressure?
  6. Check kube-system: kubectl get pods -n kube-system — is the control plane itself OK?

The ratio of (issues caused by application bugs) to (issues caused by Kubernetes itself) in mature clusters is roughly 90:10. Most “Kubernetes is broken” reports are actually “my app is broken in a way Kubernetes is faithfully showing me.”


11. The Downsides / Disadvantages

This is the section the marketing pages don’t write. None of these go away with experience or better tooling. They are the price of admission.

Downside 1: The cognitive load is permanent

Kubernetes introduces about 30 first-class concepts (pods, deployments, services, ingresses, PVCs, ConfigMaps, Secrets, namespaces, RBAC roles, ServiceAccounts, NetworkPolicies, HPAs, PDBs, …) that have no analog in single-VM thinking. Every team member must hold these in their head, every time. There is no point at which you “graduate” from needing to know them.

Where it comes from: the abstraction is genuinely deep. Kubernetes is rebuilding much of what an OS does (scheduling, networking, storage management, identity) at the cluster level. The concept count is justified by the model’s expressiveness — but you still have to learn it.

What it costs you: new hire onboarding time goes from “deploy your first PR in a week” to “deploy your first PR in two weeks plus a Kubernetes course.” Your debugging mental stack is two layers deeper than before — when something breaks, “is this an app bug, a container bug, a Kubernetes bug, or a CNI bug?” is a real question. Hiring is harder; people with deep K8s knowledge command a salary premium that compounds across your platform team.

When this is a dealbreaker vs. livable: it’s a dealbreaker for teams under 10 engineers without a dedicated platform person. It’s livable starting around 30 engineers, where the per-team value outweighs the per-engineer cost. For very small teams shipping web apps, Heroku/Railway/Fly.io give you 80% of the value at 5% of the cognitive cost.

What people think mitigates it but doesn’t: “We’ll just use a managed service.” Managed K8s removes the control-plane operational burden — perhaps 20% of the total cognitive load. The remaining 80% (workload concepts, networking, storage, debugging) is unaffected. EKS is not “Heroku”; it’s “managed control plane plus all of K8s’ complexity.”

Downside 2: YAML, all the way down

The configuration language is YAML. YAML is whitespace-sensitive, has type-coercion footguns (“yes” becomes true; “12:34” becomes a sexagesimal number), and offers zero help for refactoring. Real Kubernetes deployments are 200–2000 lines of YAML across 10–30 files. Helm adds Go templates inside YAML, doubling the parsing complexity. Errors are caught only at apply time, often in cryptic ways.

Where it comes from: Kubernetes’ API is JSON; YAML is the human-friendly equivalent. The choice was made early and is now load-bearing for the entire ecosystem. Tools like CDK8s try to escape it, but the cluster still consumes YAML.

What it costs you: silently dropped fields you misspelled, indentation mistakes that take 20 minutes to find, three “different” config conventions across Helm-with-templates, Helm-without-templates, Kustomize, and raw YAML. Engineers spend an unreasonable amount of time on configuration plumbing.

When this is a dealbreaker: never quite a dealbreaker, but a permanent productivity tax that’s larger than people admit.

What people think mitigates it but doesn’t: linters, schema validators, language-aware editors. They help. They don’t fix that the modeling layer is text.

Downside 3: Networking is its own discipline

Kubernetes networking is not “the network,” but a stack: pod IPs (CNI), Service IPs (kube-proxy), DNS (CoreDNS), Ingress controllers, possibly a service mesh, possibly NetworkPolicies, possibly multi-cluster. Each layer is a complete system with its own failure modes. A networking issue in K8s is rarely “is the cable plugged in” and almost always “which of the seven layers is misbehaving.”

Where it comes from: the design choice to make every pod a first-class network endpoint — beautifully simple in concept — required reinventing routing, NAT, DNS, and load balancing inside the cluster. You’re running a small ISP.

What it costs you: networking debugging requires understanding kube-proxy modes (iptables vs IPVS vs nftables vs eBPF), CoreDNS query patterns, conntrack table behavior, and cloud provider load balancer quirks. The “it’s always DNS” meme exists because it really is, frequently, DNS — and DNS in K8s is its own subsystem with its own failure modes.

When this is a dealbreaker: for low-latency, high-QPS workloads where 1ms matters, the network stack overhead and tail latencies of K8s networking are real. Major operators (Uber, Lyft) have invested heavily in eBPF-based alternatives (Cilium) precisely because the default iptables kube-proxy hits walls at scale.

Downside 4: Stateful workloads remain a sharp edge

Kubernetes is brilliant at stateless workloads. It’s adequate for most stateful workloads. It is genuinely awkward for highly available, single-instance-per-shard stateful workloads — which is what databases are.

The friction surfaces include: persistent volumes that don’t follow pods across AZs; the difficulty of “evict this pod and reschedule it elsewhere with the same volume”; backups are external; failover requires an operator or external coordination; cluster autoscaler can deadlock with stateful pods (won’t remove a node that has a PV-bound pod that can’t move).

Where it comes from: Kubernetes’ core model assumes pods are interchangeable. StatefulSets bolt on stable identity, but the underlying network and storage layers don’t fully reflect it.

What it costs you: running your primary database in K8s is an experienced-engineer-team commitment. Operators help, but the operators you trust (PGO from CrunchyData, Strimzi for Kafka, CockroachDB’s operator) are themselves substantial pieces of software. The default pattern at most companies, for good reason, is “managed service for primary databases, K8s for everything else.”

When this is a dealbreaker: if your team can’t dedicate engineering to running databases, don’t run them in K8s. Use RDS/CloudSQL/managed-anything. The blast radius of getting it wrong is data loss.

Downside 5: API-server-as-bottleneck means scaling has a ceiling

The API server and etcd are the bottleneck of every K8s cluster. Hard limits cluster around 5,000 nodes and 150,000 pods per cluster (Kubernetes’ published scalability targets). Real-world clusters get nervous at half that. When you hit the wall, you don’t get to scale up — you have to shard into multiple clusters, which is a totally different operational discipline (federation, multi-cluster service mesh, cross-cluster IAM).

Where it comes from: etcd is consensus-based and writes are serialized through Raft. Every write is global. There is no horizontal scaling path for etcd’s write throughput.

What it costs you: at large scale (thousands of nodes), every controller’s reconcile rate matters. Custom CRDs with sloppy reconcilers can DOS the API server. Operators that watch every pod across the cluster can be expensive. A blog post, a CRD, a bad operator, can take down the cluster.

When this is a dealbreaker: for the largest tech companies (Google’s internal Borg, Meta’s Twine), Kubernetes was never going to scale enough. They use their own systems. For the rest of us, the wall is far enough out that we’ll never hit it. But it’s there.

Downside 6: The ecosystem moves fast and adds, never subtracts

The Kubernetes API has a deprecation policy, but the ecosystem around it does not. CNI plugins, Ingress controllers, service meshes, secret management, GitOps tools, policy engines (OPA, Kyverno), observability stacks — there are five viable choices for each, and the answer to “which should we use?” changes every 18 months. Your cluster’s config becomes a fossil record of which tool was hot when each component was added.

Where it comes from: the project’s success spawned a CNCF-shaped graveyard of technologies, each with a community and momentum of its own. The K8s API surface keeps growing; Gateway API is replacing Ingress; sidecars are being replaced by ambient mesh; Pod Security Policies were replaced by Pod Security Admission; etc.

What it costs you: keeping up is a real ongoing investment. A team that was “current” two years ago is now behind. Migrations between similar tools (Ingress → Gateway, sidecar mesh → ambient) consume sprint capacity that produces no user-visible value.

When this is a dealbreaker: for teams that can’t afford ongoing platform investment, it’s a slow-motion dealbreaker. They fall behind, hire becomes harder (“you don’t use X?”), and the gap to “current best practice” widens until a forced migration.

What people think mitigates it but doesn’t: “We’ll standardize on a stable subset.” Good intention, hard to enforce, because every new third-party tool you install (Prometheus operator, ArgoCD, etc.) drags in its opinions.

Downside 7: The “portability” promise is partially a mirage

The pitch: Kubernetes runs on any cloud or on-prem, so you’re not locked in. The reality: you are locked into Kubernetes and its ecosystem. Migrating between K8s clusters is much easier than not having K8s — true. But the dream of “lift our workload from EKS to GKE in an afternoon” hits real obstacles: cloud-specific load balancer annotations, IAM-bound ServiceAccounts, EKS’s VPC CNI vs. GKE’s defaults, AWS Secrets Manager integrations, S3-specific storage drivers, ALB-specific Ingress.

Where it comes from: the abstraction is at the K8s API level, not at the integration level with your cloud. The 80% that’s portable is the boring 80%. The 20% that’s not portable is what ties your workload to its cloud.

What it costs you: an actual cloud migration is months of work, even with K8s in the picture. The “K8s is portable” reasoning often shows up in arguments to adopt K8s but rarely shows up in arguments about which cloud to use — because the migration story is still painful.

When this is a dealbreaker: never quite — portability is real, just oversold. But teams that adopt K8s solely “for portability” have usually paid more than they’d save.

Downside 8: Failure modes are often quiet

Kubernetes’ self-healing is mostly a feature but occasionally a bug. A misconfigured Deployment will roll out badly, then the autoscaler tries to compensate, then nodes fill up, then evictions happen — and from a glance at the dashboard, your service appears “running, mostly.” Symptoms manifest as elevated error rates and tail latencies, not loud crashes. Quiet failures in K8s are easier to miss than loud failures in simpler systems.

Where it comes from: Kubernetes is designed to keep going — eviction rather than crash, throttling rather than killing, retry rather than fail-fast. This is the right default for resilience but the wrong default for clarity.

What it costs you: observability has to be more sophisticated. You need golden signals per-service, not just node-level metrics. You need to alert on “wrong number of pods Ready” not just “any pods Running.” Many incidents are a slow drift, not a sharp event.

When this is a dealbreaker: for systems where degradation is worse than outage (e.g., financial systems where slow trades are worse than no trades), the “soft failure” default of K8s is dangerous and must be explicitly counteracted with monitoring and circuit breakers.

Downside 9: Multi-tenancy is pretend

You’ll hear that K8s namespaces give you multi-tenancy. They give you weak multi-tenancy: separate names, RBAC, quotas. Pods in different namespaces share the same kernel, same kubelet, same etcd, same scheduling fate. A noisy neighbor in another namespace can starve your pod. A privileged escape from any pod gives access to every pod on the same node. Real multi-tenancy at the cluster level (different actual customers, different security domains) requires hard separation: separate clusters, or at minimum technologies like virtual clusters (vcluster) and gVisor / Kata Containers.

Where it comes from: namespaces were designed for organizational separation within a single trusted operator, not for hostile multi-tenancy. The model has not been retrofitted.

What it costs you: if you wanted to host customer workloads in K8s, you can’t safely give them access to a shared cluster. You’re either spinning a cluster per customer (expensive) or giving them strong VM-like isolation (defeats the lightness of containers).

Downside 10: There is no “done”

Kubernetes is not a thing you install and forget. Quarterly version upgrades. CVE patches. Operator upgrades that drag in CRD migrations. The ecosystem moves; your cluster either moves with it or accumulates debt. A 3-year-old, “stable” cluster is usually a thicket of legacy CRDs, deprecated API versions, end-of-life ingress controllers, and out-of-support node images. Cluster maintenance is a continuous part of your team’s work, not a phase.

Where it comes from: the project’s release cadence (3 minor versions per year, ~12 month support window) and the ecosystem’s appetite for change.

What it costs you: somewhere between 5% and 15% of platform engineering time, indefinitely. This number is invisible at adoption time and very visible 3 years in.


12. The Taste Test

How do you tell at a glance whether someone’s Kubernetes is the work of a experienced engineer or a copy-paste tutorial graduate? Look for these.

Manifests

beginner:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 1                # always 1
  template:
    spec:
      containers:
      - name: app
        image: myorg/myapp:latest    # mutable tag
        # no resource requests/limits
        # no probes
        # no securityContext

experienced:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  labels:
    app.kubernetes.io/name: myapp
    app.kubernetes.io/version: "1.4.7"
spec:
  replicas: 3
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 0       # never drop below desired
  selector:
    matchLabels:
      app.kubernetes.io/name: myapp
  template:
    metadata:
      labels:
        app.kubernetes.io/name: myapp
        app.kubernetes.io/version: "1.4.7"
    spec:
      containers:
      - name: app
        image: myorg/myapp:1.4.7@sha256:abc...   # immutable, digest-pinned
        resources:
          requests:
            cpu: 200m
            memory: 512Mi
          limits:
            memory: 512Mi      # = request, no CPU limit
        readinessProbe:
          httpGet: { path: /ready, port: 8080 }
          periodSeconds: 5
        startupProbe:
          httpGet: { path: /healthz, port: 8080 }
          failureThreshold: 30   # 30 * 10s = 5min for slow startup
        securityContext:
          runAsNonRoot: true
          runAsUser: 65532
          readOnlyRootFilesystem: true
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app.kubernetes.io/name: myapp

The experienced version reveals: digest-pinned images, no replicas: 1 for production, maxUnavailable: 0 to keep availability during rolls, no liveness probe (deliberate — they’ve been bitten), startup probe instead, memory request equals limit, no CPU limit, security hardening, AZ spread.

Pod design

beginner: one pod has the app, the database, redis, nginx, and a cron job, all running. “It deploys together.”

experienced: one pod has the app and at most one tightly-coupled sidecar. Each component is its own Deployment. Things scaled differently are deployed differently.

Namespaces

beginner: everything in default namespace. Or one namespace per microservice.

experienced: namespaces map to something meaningful: teams, environments, blast-radius boundaries. Not one-per-service (that’s labels) and not all-in-default (that’s chaos).

Configuration

beginner: kubectl edit on prod to fix things; passwords in the Deployment YAML committed to git; ConfigMaps with one giant application.properties blob.

experienced: all config in git; secrets via External Secrets Operator pulling from Vault/AWS Secrets Manager; ConfigMaps split by concern (feature-flags, connection-pool, logging-config); rollouts via kubectl rollout restart after CM/Secret changes.

Service exposure

beginner: every service has type: LoadBalancer. Ten services, ten ELBs at $30/month each.

experienced: internal services are ClusterIP; external services go through one or two Ingress controllers (or a Gateway API gateway) that fan out by host/path. TLS terminated at the edge with cert-manager. One LB doing the work of ten.

Health checks

beginner: liveness probe hits /health which queries the database. Or no probes at all.

experienced: readiness checks downstream dependencies; liveness checks only internal invariants (or doesn’t exist); startup probe handles slow boots; explicit terminationGracePeriodSeconds to let in-flight requests drain.

Resource configuration

beginner: copy-pastes resource requests from a tutorial (100m/128Mi for everything).

experienced: requests reflect measured usage at P95 or P99, reviewed periodically; memory limit equals request; no CPU limit on user-facing services; PriorityClass for critical workloads.

CI/CD and deployment

beginner: kubectl apply -f from CI; “the YAML in the repo and the cluster might differ but mostly they’re the same.”

experienced: GitOps (ArgoCD or Flux); cluster state is defined by the git repo; manual kubectl apply is a procedural exception, not a pattern; PRs are the change-control mechanism.

Observability

beginner: kubectl logs is the debugging tool. Production incidents are debugged by SSH-equivalent.

experienced: centralized logs (Loki, ELK, Datadog) with structured logging; per-service Prometheus dashboards with golden signals (latency, traffic, errors, saturation); distributed tracing for cross-service requests; alerts on derivative signals (rate of change), not absolute thresholds.

A simple smell test

Run kubectl get pods -A | grep -v Running | grep -v Completed. In a healthy cluster managed by people who know what they’re doing, this returns either zero results or one or two pods that are clearly transitioning. In a cluster that’s slowly going wrong, this returns a half-screen of CrashLoopBackOff, ImagePullBackOff, Pending, Error, Evicted — and nobody has noticed because nobody alerts on it.


13. Where to Go Deeper

Curated. These are the resources I’d actually point a serious learner to.

  • The official documentation, specifically the Concepts sectionhttps://kubernetes.io/docs/concepts/. The reference docs are reference; the Concepts pages are teaching. Read every page in Architecture, Workloads, Services, Storage. Skip the API reference unless you need it.

  • “Kubernetes the Hard Way” by Kelsey Hightowerhttps://github.com/kelseyhightower/kubernetes-the-hard-way. Walks you through bootstrapping a cluster from raw VMs by hand. You will not run a cluster this way in production, but doing it once teaches you what every component is, how they connect, and what the managed services hide. The two-day investment pays for years.

  • “Kubernetes: Up and Running” (Hightower, Burns, Beda) — the canonical book, currently in its third edition. Best linear path through the platform. Covers more than this document, less opinionated, and complements rather than replaces.

  • The kubernetes-failure-stories collectionhttps://k8s.af/. A curated list of real production postmortems. The single most valuable resource for understanding how K8s breaks at organizations that are not yours. Read at least 10 of these before running anything important.

  • “Programming Kubernetes” (Hausenblas, Schimanski) — for when you want to write controllers and operators. Goes deep into informers, work queues, the controller-runtime library. Don’t read until you’ve used Kubernetes for at least a year.

  • Tim Hockin’s talks — Tim is the lead networking architect of Kubernetes from day one. His KubeCon talks on Services, networking, and the design philosophy are the highest-density technical content in the ecosystem. Search YouTube for “Tim Hockin Kubernetes.”

  • The CNCF Landscape and TAG Workshopshttps://landscape.cncf.io/. The chaotic-looking landscape is actually a useful map of the ecosystem. Don’t try to learn it all; use it as a directory.

  • Hands-on: build a small operator with Kubebuilderhttps://book.kubebuilder.io/. Pick a toy domain (e.g., Website CRD that creates a Deployment + Service + Ingress). Write the controller. You will learn more in 8 hours than from 80 hours of reading. The controller’s reconciliation loop is where the entire mental model finally clicks.

  • Cilium documentation, and Liz Rice’s eBPF talks — eBPF is the future of K8s networking. Cilium is the leading implementation. Even if you don’t switch CNIs, understanding eBPF-based networking is increasingly table stakes.

  • Production-grade reference repos — look at how serious open-source projects ship K8s manifests: Prometheus Operator’s bundle, Linkerd’s installer, Strimzi’s Kafka operator. Reading well-maintained YAML in the wild is how you learn what good looks like.


14. The Final Verdict

After all of that — the four mental-model bricks, the dozens of concepts, the post-mortems, the gotchas, the eight downsides — what do you actually believe about Kubernetes?

Here’s the honest take. Kubernetes is the right answer to a problem most companies don’t have, and they adopt it anyway, and somehow it still tends to work out. The problem K8s solves brilliantly is “I have hundreds of services across hundreds of machines, run by dozens of teams, and need a substrate where workload abstractions, scheduling, networking, and rollouts are unified and declarative.” If that’s your problem, K8s is unmatched, and the cost of building anything comparable yourself is 10x worse. If that’s not your problem — if you’re a 10-person company with five services running on six VMs — then K8s is overhead pretending to be infrastructure, and Heroku, Cloud Run, or a managed PaaS will let you ship 5x faster.

The reason adoption “still tends to work out” even for the wrong-fit cases is that K8s’ ecosystem has absorbed enough talent that the scaffolding (managed control planes, ArgoCD, Helm charts for everything, off-the-shelf observability) is now genuinely better than the home-grown alternatives most teams would otherwise build. You pay the K8s tax and get something that, on the third year, is more correct, more debuggable, and more hireable for than what you’d have built yourself. The first year is brutal. The third year is fine.

What it gets profoundly right. Three things. First, the desired-state-plus-controllers pattern. This is genuinely a contribution to how we think about distributed systems. Once you’ve internalized “describe the world; let the loops fight reality,” many other systems start to look badly designed by comparison. The Operator pattern’s success — extending K8s to manage anything, from databases to TLS certs to GitHub repos — is downstream of this idea being correct. Second, the API surface is genuinely well-designed at the resource level. Pods, Services, Deployments, ConfigMaps — these are the right abstractions for what they’re abstracting. Most extensions of the API since 2015 have fit cleanly into this model, which is rare for any platform that grew this fast. Third, label-based composition. Things relate to each other not by IDs but by matching properties. This makes cluster state genuinely refactorable: you can re-label, re-namespace, re-organize, and the controller graph re-converges. Few systems give you this freedom.

What it gets wrong, or what it costs you. The downsides are not separable from the strengths — most of them are the shadows cast by the design choices that make K8s good at what it’s good at. The cognitive load is the price of the model’s expressiveness. The networking complexity is the price of giving every pod a real IP. YAML is the price of having a human-readable declarative interface to a JSON API. The endless ecosystem churn is the price of the project’s openness. You can’t have the upside without the downside, and people who pretend you can are selling something. The honest framing is: K8s is expensive, and what you get for the expense is the ability to build a serious platform on top of it.

Who should reach for this and who shouldn’t. Reach for K8s if: you have 30+ engineers and 10+ services that need to coexist; you have a platform/SRE team or are committed to building one; your workloads are genuinely heterogeneous (different scaling needs, different deploy cadences); you’ve outgrown a PaaS like Heroku and are paying it back in either money or capability. Don’t reach for K8s if: you’re a small team shipping one or two web apps; you don’t have or won’t build platform engineering capacity; you’re attracted to it because “everyone uses it.” Kubernetes is not a maturity badge. The most experienced engineers I know are the most willing to choose Cloud Run or App Engine or a single big VM when the workload doesn’t need K8s.

What you should now believe. Believe that the desired-state-plus-controllers model is a deep idea worth understanding even if you never run a Kubernetes cluster. Believe that managed K8s (EKS/GKE/AKS) is a different category of product from “self-managed K8s” and that the latter is for compliance edge cases only. Believe that the database is the last thing to put into Kubernetes, not the first, and that “all our infra in K8s for consistency” is a bad reason. Don’t believe that K8s makes microservices good — it removes some friction, which makes the coordination problem of microservices more visible, not less. Don’t believe the portability story is fully true — it’s directionally true, with substantial fine print. When you hear someone say K8s is “complex,” ask whether they mean essential complexity (the irreducible difficulty of orchestrating many services) or accidental complexity (the YAML and the ecosystem). The former is mostly inherent. The latter is where reasonable people disagree most.

The hard-won line, after everything: Kubernetes is a brilliant answer for the problem it solves, an expensive answer for problems it doesn’t, and the most common engineering mistake in 2026 is still confusing the two. Choose accordingly.


The ideas are mine. The writing is AI assisted