Postgres Deep Intuition
An experienced engineer's guide to Postgres
This document teaches Postgres the database engine — not SQL. SQL is the protocol you speak; Postgres is the machine you’re speaking to. The two are routinely confused because Postgres is the thing you typically psql into, but they’re separable: SQL is an ANSI standard implemented by dozens of databases, while Postgres is a specific piece of software with a specific architecture, specific tradeoffs, and specific failure modes that have nothing to do with the SQL standard.
If you’ve used Postgres for a while and never thought about why VACUUM exists, why you need a connection pooler, why a long-running transaction can balloon your disk, or why your UPDATE made the table bigger instead of leaving it the same size — this is for you.
1. One-Sentence Essence
Postgres is a process-per-connection relational database that achieves concurrency by keeping multiple versions of every row, and durability by writing every change to a sequential log before touching the actual data files.
That sentence has three load-bearing words — process-per-connection, multiple versions, and log — and the entire rest of this document is unpacking what each of them implies. If you internalize only that sentence, you’ll already predict the existence of VACUUM, the need for pgbouncer, the meaning of WAL, and why pg_repack is a thing.
2. The Problem It Solved
Postgres was born in 1986 at Berkeley as the successor to Ingres (the name is literally “Post-Ingres”). Michael Stonebraker wanted a database that wasn’t a closed black box of B-trees and locks — he wanted a database you could extend: add new data types, new operators, new index methods, new procedural languages. That’s the gene that gives modern Postgres its absurdly broad type system (jsonb, ranges, arrays, geometric types, custom domains), six index types out of the box plus extensible ones, server-side languages from Python to JavaScript, and the extension ecosystem (PostGIS, TimescaleDB, pgvector). When other databases say “we support JSON now,” they bolted it on. When Postgres adds a new type, it slots into an existing extensibility model that’s been there since the eighties.
The other formative choice was concurrency control. In 1996–2001, when Postgres added MVCC and transactional DDL, the dominant model was lock-based: readers blocked writers and writers blocked readers, and you tuned your application around lock contention. Postgres’ designers chose a different path borrowed from Berkeley research: keep old versions of rows around, let readers see the version that was current when their transaction started, let writers create new versions without disturbing those readers. This sounds elegant — and it is — but it transferred a cost from the read path to the maintenance path. Old versions are dead bodies in the table. Someone has to clean them up. That someone is VACUUM, and the entire operational tax of running Postgres traces back to this one decision.
The third thing worth knowing about Postgres’ history: it has never been owned by a single company. There’s no “Postgres Inc.” It’s developed by a group of contributors who release roughly once a year, with a strict commitment to backward compatibility and correctness over flashy features. This is why Postgres feels conservative — it doesn’t ship things until they’re right — and why third-party companies (Crunchy, EDB, Citus, Supabase, Neon) build around it rather than forking it. The project culture is: get it right, even if it takes a decade. Logical replication took 10+ years. Parallel query execution took years. Built-in JSON took years. The flip side is that when a feature lands in core Postgres, it’s usually production-ready in a way that a 1.0 release from a startup database isn’t.
3. The Concepts You Need
Postgres has a thick vocabulary. Before you can think about it clearly, you need fluency in roughly four families of terms. We’ll group them: transactions and visibility, storage, processes and memory, maintenance.
Transactions and visibility
A transaction is a sequence of statements that succeed or fail atomically. It starts with BEGIN (implicitly or explicitly) and ends with COMMIT or ROLLBACK. Inside a transaction you see a consistent view of the database, regardless of what other transactions are doing.
Every transaction that modifies data gets a transaction ID (XID) — a 32-bit monotonically increasing integer. Read-only transactions don’t consume an XID; this is an important optimization, because XIDs are scarce (more on this when we discuss wraparound). The XID is the unit Postgres uses to reason about “who did what when.”
A tuple is Postgres’ word for “a particular version of a row.” When you UPDATE a row, the old tuple isn’t modified — a new tuple is appended. Both tuples physically exist; visibility rules decide which one each transaction sees. Every tuple carries two hidden system columns: xmin (the XID that created this version) and xmax (the XID that deleted or superseded it, or 0 if the version is still live). These two integers are the entire core of MVCC. You can SELECT xmin, xmax, * FROM your_table to see them.
MVCC (Multi-Version Concurrency Control) is the policy: a reader sees only tuples whose xmin committed before the reader’s snapshot was taken, and whose xmax either is zero or hadn’t committed by then. This means readers never block writers and writers never block readers. They just see different versions of the same logical row.
A snapshot is a small struct (xmin, xmax, list of in-flight XIDs) that a transaction captures at the moment it needs visibility (depending on isolation level, this is either at the start of the transaction or at the start of each statement). Snapshots are how Postgres turns “what does this transaction see?” from a complicated question into a fast bitfield comparison.
A dead tuple is a tuple that no live snapshot can possibly need anymore — its xmax committed and is older than every active transaction’s xmin. Dead tuples are what VACUUM cleans up. Until then they take up space in the table.
Bloat is the disk space occupied by dead tuples and unused space inside pages. A table that’s 50% bloated has roughly twice the data it logically needs. Bloat slows scans, wastes memory, and is the most common surprise for people new to Postgres.
Isolation levels: READ COMMITTED (default — each statement sees a fresh snapshot), REPEATABLE READ (snapshot taken at first statement, held for the whole transaction), and SERIALIZABLE (REPEATABLE READ plus extra checks that detect serialization anomalies and abort one of the offending transactions). Postgres has no READ UNCOMMITTED — asking for it gives you READ COMMITTED, because dirty reads are impossible in MVCC.
Storage
Postgres stores everything in pages of 8KB by default. A page is the unit of I/O — the smallest thing read from disk or written to disk. Inside a page is a header, a line pointer array (pointing to tuples within the page), and the tuples themselves growing from the bottom up. When a page is full, the next tuple goes on a new page.
A heap is the unordered file of pages that stores a table. “Heap” because there’s no ordering guarantee — rows live wherever there was space when they were inserted. If you want order, you need an index or CLUSTER (which physically reorders, briefly).
An index is a separate data structure that lets you find tuples without scanning the whole heap. Postgres has six built-in index access methods (btree, hash, gin, gist, spgist, brin), and the access method is pluggable — extensions add more (e.g., pgvector adds hnsw and ivfflat).
Each table actually has multiple forks on disk: the main data fork (base/<dbid>/<relfilenode>), the free space map (_fsm, tracks where there’s room to insert new tuples), and the visibility map (_vm, tracks which pages contain only tuples visible to all transactions — used to skip work during vacuum and enable index-only scans). You’ll see these as separate files when you look in the data directory.
A relfilenode is the physical filename for a table’s storage. Important because operations like VACUUM FULL, CLUSTER, REINDEX, and TRUNCATE change the relfilenode — they write a fresh file and swap. The OID stays the same; the relfilenode changes.
The WAL (Write-Ahead Log) is a sequential stream of every change made to the database, written to pg_wal/ as 16MB segments. Every INSERT, UPDATE, DELETE, index modification, page allocation — everything that mutates state — is described in the WAL before the corresponding data page is modified in memory. The WAL is the source of truth for crash recovery, replication, and point-in-time recovery.
The LSN (Log Sequence Number) is a monotonically increasing 64-bit number that identifies a position in the WAL. Every WAL record has an LSN. LSNs are how replicas know what they’ve received, how backups know what to apply, and how you measure replication lag.
Shared buffers is Postgres’ in-memory cache of pages. When a query needs a page, it first checks shared buffers; if it’s there (a hit), no disk I/O. If not (a miss), the page is read from the OS file cache (or actual disk). Shared buffers also holds dirty pages — pages that have been modified but not yet written back to disk. The buffer manager’s job is to keep hot pages in memory and write dirty ones back without losing data.
A checkpoint is a moment where Postgres writes every dirty page back to disk and records “everything before this LSN is safely on disk.” It’s the bookkeeping that lets WAL files be recycled and bounds crash recovery time. Checkpoints happen on a timer (default 5 minutes) or when WAL hits a size threshold (max_wal_size).
A hint bit is a per-tuple flag set lazily on first read, recording “the transaction that created this tuple has committed” (or aborted). It’s a cache that avoids re-checking commit status. Hint bits are the reason a SELECT can dirty a page: the read found uncommitted-looking tuples, the read process checked the commit log, then wrote the hint bit back into the page so the next reader can skip the check.
Processes and memory
The postmaster is the supervisor process. It listens on the configured port (5432 by default), accepts incoming connections, authenticates them, and fork()s a backend for each one. It also forks the various background processes at startup, restarts them if they die, and resets the system if a backend crashes. The postmaster itself doesn’t touch shared memory — it stays minimal and reliable so that even a backend crash can be cleanly recovered from.
A backend is the process that handles one client connection from start to finish. It parses queries, plans them, executes them, and ships results back. Each backend has its own memory (work_mem for sorts/hashes, temp_buffers for temp tables, catalog caches, plan cache) and reads/writes pages through shared buffers. When the client disconnects, the backend exits.
The background processes are the housekeeping crew. The major ones:
- Checkpointer: writes dirty pages out at checkpoints and pads them out so they don’t all hit disk at once.
- Background writer (bgwriter): continuously trickles dirty pages to disk between checkpoints, smoothing I/O.
- WAL writer (walwriter): flushes WAL buffers to disk asynchronously when commits don’t require sync.
- Autovacuum launcher: periodically wakes up and decides which tables need vacuuming, then spawns workers.
- Autovacuum workers: actually run
VACUUMandANALYZEon the chosen tables. - Stats collector / cumulative statistics: tracks per-table activity (rows inserted, updated, dead tuples) — autovacuum reads this to decide what to clean.
- Logical replication launcher / walsender / walreceiver: manage replication streams.
- Archiver: copies completed WAL segments to a backup location if archiving is on.
Shared memory is the segment all processes access concurrently. It contains shared buffers, the WAL buffers, the lock table, the proc array (list of all active backends and their snapshots), and a few smaller structures. Concurrency control on shared memory uses a mix of LWLocks (lightweight locks for short critical sections) and spinlocks. Most people never tune any of this directly — but understanding that all backends share this memory explains why backend count is bounded.
Maintenance
VACUUM (the verb) reclaims space from dead tuples and updates the visibility map. Plain VACUUM doesn’t return space to the OS — it marks the space as reusable inside the file. VACUUM FULL rewrites the entire table to a new file and gives the space back, but takes an ACCESS EXCLUSIVE lock and blocks everything. VACUUM ANALYZE runs vacuum and refreshes planner statistics in one go.
Autovacuum is the daemon that runs VACUUM and ANALYZE automatically. It’s not a separate utility — it’s VACUUM invoked by background workers based on configurable thresholds. Most production tuning of Postgres is autovacuum tuning.
ANALYZE updates the statistics in pg_statistic (most-common-values, histograms, distinct-value estimates) that the planner uses to choose query plans. Bad statistics mean bad plans. Autovacuum runs ANALYZE on a similar trigger to VACUUM.
Freezing is the process of replacing a tuple’s xmin with a special “frozen” marker that means “infinitely old; visible to everyone.” It’s necessary because XIDs are 32-bit and will wrap around after ~4 billion transactions. Frozen tuples don’t need their XIDs compared anymore, so wraparound becomes safe. VACUUM does this opportunistically; if you fall behind, Postgres will eventually start running emergency anti-wraparound vacuums to catch up.
XID wraparound is the apocalyptic failure mode: if VACUUM falls so far behind that the oldest unfrozen XID is about to be overrun by the new XID counter, Postgres shuts down writes to protect data. This used to be a real production threat; modern autovacuum and the “failsafe” mode in PG 14+ make it rare but not impossible.
Replication slot is a server-side bookmark that tells the primary “this consumer needs WAL up to LSN X — don’t recycle anything newer than that.” Slots make replication durable across disconnects. They are also a major footgun: an abandoned slot will pin WAL forever and fill your disk.
That’s the vocabulary. The rest of the document refers back to these concepts constantly — when you read “hint bit” or “tuple visibility” later, come back here if it doesn’t ring a bell.
4. The Distilled Introduction
This section covers what a typical “PostgreSQL course” covers in 10 hours, but compressed to what matters and explained at the right altitude. After this section you can install Postgres, connect to it, work with it confidently, and you’ll understand why the commands do what they do — not just the syntax. We’ll forward-reference the Mental Model and other sections; don’t worry if some “why” answers are deferred.
4.1 Installing and starting Postgres
Postgres has three install paths, and which one you pick shapes your relationship with it.
The package install (apt install postgresql, brew install postgresql, dnf install postgresql-server) gives you a system service: data dir under /var/lib/postgresql/<version>/main, config under /etc/postgresql/<version>/main, started by systemd, runs as the postgres OS user. Best for production servers and for learning the standard layout.
The Docker install is docker run -e POSTGRES_PASSWORD=secret -p 5432:5432 postgres:16. Best for development. Be aware Docker volumes have to be mounted explicitly or you’ll lose data on container removal — -v pgdata:/var/lib/postgresql/data is a common pattern.
The managed install (RDS, Aurora, Cloud SQL, Neon, Supabase, Crunchy Bridge, Heroku) means you don’t run Postgres — someone else does. You get a connection string. This is fine for production but it means many of the things in this document — postmaster, processes, WAL files — you’ll never see directly. You’ll hit the consequences (connection limits, replication lag, wraparound warnings) without seeing the mechanism. Read this document anyway.
A Postgres cluster is one running instance — one postmaster, one shared data directory (PGDATA), one port. It can host many databases. Confusingly, “cluster” in Postgres-speak does not mean “multiple machines”; it means “one instance with possibly many databases inside.” A cluster is initialized with initdb, started with pg_ctl start or systemctl start postgresql.
After install, there’s a default postgres superuser, a postgres database, and two template databases: template0 (a pristine empty template you can’t connect to or modify) and template1 (the template used when you CREATE DATABASE — anything you put in template1 will appear in every new database). When in doubt, work in your own database, not postgres and never template1.
4.2 Connecting
Postgres clients speak a binary wire protocol on TCP port 5432. The reference command-line client is psql. The connection arguments matter:
psql -h localhost -p 5432 -U myuser -d mydb
Or with a connection URL:
psql "postgresql://myuser:secret@localhost:5432/mydb"
If you omit the host, psql connects via Unix socket (faster, common in production). Authentication is governed by pg_hba.conf (the “host-based authentication” file) — a list of rules matching {connection type, database, user, source IP} to an authentication method (trust, md5, scram-sha-256, peer, cert). A common trip-up: edit postgresql.conf and pg_hba.conf to allow remote connections, and reload (SELECT pg_reload_conf()) for pg_hba.conf changes — they don’t need a restart.
psql is itself a substantial tool. The metacommands (\d, \dt, \df, \du, \l) are how you explore the schema; \d users is more useful than reading the catalog manually. \timing on shows query duration. \watch 1 reruns the last query every second (great for monitoring). \copy does client-side bulk load. Learn psql — every Postgres operator falls back to it eventually.
4.3 Roles, databases, schemas
Postgres has a permission model with three layers, and people get it wrong constantly.
A role is what most other systems call a user. Postgres unifies users and groups into roles; a role can LOGIN (be a user) or not (be a group). CREATE ROLE alice WITH LOGIN PASSWORD '...' makes a user. CREATE ROLE devs makes a group. GRANT devs TO alice makes alice a member.
A database is a top-level container. Connections target one database; you can’t query across databases in the same query (unless you use dblink or foreign data wrappers). Cross-database queries are usually a sign of a design problem.
A schema is a namespace within a database. public is the default schema. You can have app.users and analytics.users as distinct tables in the same database. Schemas are how you partition concerns within a database; they’re searched in order according to search_path (default: "$user", public). Production tip: make search_path explicit — relying on the default is how schema-related bugs sneak into apps.
Permissions are granted at object level: GRANT SELECT ON users TO alice. There’s also GRANT USAGE ON SCHEMA app TO alice (you need this to use anything in the schema at all) and the database-level GRANT CONNECT ON DATABASE mydb TO alice. Permissions don’t apply to future tables unless you set ALTER DEFAULT PRIVILEGES — a near-universal new-user gotcha.
4.4 Creating tables and the basics of types
CREATE TABLE accounts (
id bigserial PRIMARY KEY,
email text NOT NULL UNIQUE,
created_at timestamptz NOT NULL DEFAULT now(),
metadata jsonb NOT NULL DEFAULT '{}'::jsonb
);
A few things to notice rather than gloss over:
bigserialis shorthand for “make a sequence and abigintcolumn with a default that pulls from it.” In modern Postgres (10+), preferid bigint GENERATED ALWAYS AS IDENTITY— same effect, standardized syntax, behaves better with logical replication.textis the right string type for almost everything.varchar(n)exists, butndoesn’t make Postgres faster — it just adds a check constraint. There’s no length-vs-performance tradeoff like in some other databases.timestamptz(timestamp with time zone) stores UTC and converts on display. Use it.timestamp(without tz) stores wall-clock time with no zone info — a footgun on multi-region systems.jsonbis the binary JSON type. It’s parsed and indexed;jsonis text-only. Usejsonbunless you have a reason not to.PRIMARY KEYimpliesNOT NULL UNIQUEand creates a btree index automatically. So doesUNIQUE.
Beyond the obvious types, Postgres’ type system is a feature, not a chore. Worth knowing:
- Numeric types:
bigintfor IDs and counts;numeric(arbitrary precision) for money — neverfloat.numeric(12,2)for currency. - uuid: native 16-byte type, far better than storing as text.
- Arrays: any type can be an array, e.g.
text[],int[]. Indexable with GIN. - Range types:
int4range,tstzrangeetc. — first-class half-open intervals. Combined with exclusion constraints, they give you “no overlapping bookings” as a constraint. - enum: declared with
CREATE TYPE status AS ENUM (...). Compact, but adding a value requiresALTER TYPE ... ADD VALUE(and historically had limitations — fine in modern versions). - Geometric types and PostGIS: built-in
point,polygon, etc., plus the PostGIS extension if you need real geographic work.
Constraints worth knowing: CHECK (length(email) > 0), FOREIGN KEY ... REFERENCES other_table(id) ON DELETE CASCADE, EXCLUDE USING gist (room WITH =, during WITH &&) (no two reservations on the same room with overlapping time ranges).
4.5 Inserting, updating, deleting
INSERT INTO accounts (email) VALUES ('a@example.com');
INSERT INTO accounts (email) VALUES ('b@example.com')
RETURNING id, created_at;
INSERT INTO accounts (email) VALUES ('c@example.com')
ON CONFLICT (email) DO NOTHING;
INSERT INTO accounts (email, metadata) VALUES ('d@example.com', '{"plan":"free"}')
ON CONFLICT (email) DO UPDATE SET metadata = accounts.metadata || EXCLUDED.metadata;
RETURNING is huge — the inserted/updated rows come back without a follow-up query. Use it. ON CONFLICT is “upsert,” and the EXCLUDED pseudo-table refers to the row that would have been inserted.
UPDATE and DELETE work as expected, but remember the Mental Model: an UPDATE is not an in-place mutation. Postgres marks the old tuple as deleted and writes a new tuple. Old tuple sticks around until vacuum. This has consequences we’ll cover.
COPY is the bulk-load command. COPY accounts FROM STDIN WITH (FORMAT csv) is 10–100x faster than INSERT for large loads. The \copy metacommand in psql does the same client-side.
4.6 Queries — what you should already know in two paragraphs
SELECT cols FROM table WHERE conditions GROUP BY ... HAVING ... ORDER BY ... LIMIT ... OFFSET .... Joins (INNER, LEFT, RIGHT, FULL, CROSS). Subqueries in WHERE and FROM. Set operations (UNION, INTERSECT, EXCEPT). DISTINCT and DISTINCT ON. Window functions (row_number() OVER (PARTITION BY user_id ORDER BY created_at)). CTEs (WITH active AS (SELECT ...) SELECT ... FROM active JOIN ...). All of this is standard SQL territory.
What’s worth flagging: CTEs in Postgres used to be optimization fences (the planner couldn’t push predicates through them). Since PG 12 they’re inlined by default unless you write WITH ... AS MATERIALIZED, which forces materialization. If you’re reading older Postgres advice, “use a subquery, not a CTE, for performance” is no longer valid. Use whichever is more readable.
4.7 Transactions
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;
Or ROLLBACK instead of COMMIT to undo. Inside a transaction, an error doesn’t abort the whole transaction immediately, but the transaction goes into a “failed” state where every subsequent statement returns ERROR: current transaction is aborted, commands ignored until end of transaction block. You then have to issue ROLLBACK. Savepoints let you recover: SAVEPOINT s1; ... ; ROLLBACK TO SAVEPOINT s1;.
Postgres has transactional DDL — CREATE TABLE, ALTER TABLE, CREATE INDEX are all transactional and roll back if the transaction aborts. This is rare among databases. It means schema migrations can be wrapped in BEGIN/COMMIT and either fully apply or not at all. The main exceptions: CREATE INDEX CONCURRENTLY, VACUUM, and REINDEX CONCURRENTLY cannot run inside a transaction block.
Set isolation level per transaction with BEGIN ISOLATION LEVEL REPEATABLE READ (or SERIALIZABLE). The default READ COMMITTED is fine for almost everything but means each statement sees a fresh snapshot, so you can read different values within the same transaction. REPEATABLE READ pins the snapshot for the whole transaction.
4.8 Indexes
CREATE INDEX ON accounts (email);
CREATE INDEX ON orders (user_id, created_at DESC);
CREATE INDEX ON orders (user_id) WHERE status = 'pending'; -- partial index
CREATE INDEX ON users (lower(email)); -- expression index
CREATE INDEX ON events USING gin (data jsonb_path_ops); -- jsonb index
CREATE INDEX CONCURRENTLY ON big_table (col); -- non-blocking
Default is btree. Always use CONCURRENTLY in production — without it, CREATE INDEX takes an ACCESS EXCLUSIVE lock for the entire build, blocking reads and writes. With CONCURRENTLY it scans the table twice and lets writes through; it takes longer but doesn’t take the table down. The cost: it can’t be inside a transaction, and if it fails, you’re left with an INVALID index that needs to be dropped and retried.
Multi-column index ordering matters: an index on (a, b) can serve WHERE a = ?, WHERE a = ? AND b = ?, or sort by a then b. It can’t efficiently serve WHERE b = ? alone — that requires a “skip scan” trick the planner only gained recently.
Partial indexes (WHERE status = 'pending') only index rows matching the condition. Tiny on disk, perfect when you query exactly that subset. Common pattern: index a “soft delete” flag or a hot status.
Expression indexes (lower(email)) let WHERE lower(email) = ? be indexed.
We’ll cover the other index types (GIN, GiST, BRIN, hash) in the architecture and judgment-call sections.
4.9 EXPLAIN
EXPLAIN SELECT * FROM accounts WHERE email = 'x'; -- estimated plan
EXPLAIN ANALYZE SELECT * FROM accounts WHERE email = 'x'; -- actually runs it, shows actuals
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT ...; -- the version you actually want
EXPLAIN is non-negotiable Postgres skill. The output is a tree of nodes (read inside-out — leaves first). Every node has an estimated cost, an estimated row count, and (with ANALYZE) actual time and actual rows. The two numbers you stare at most:
- Estimated rows vs actual rows: if these are off by 10x or more, your statistics are wrong or there’s a correlated predicate the planner can’t model. Either run
ANALYZE, raisedefault_statistics_target, or create extended statistics (CREATE STATISTICS). Bad row estimates lead to bad plans. - Time per loop × loops: a node showing
(actual time=0.1..0.2 rows=5 loops=1000)ran 1000 times for a total of 100–200ms even though “0.1..0.2” looks tiny.
BUFFERS shows shared hits vs reads — how much was cached vs from disk. VERBOSE adds expression details. auto_explain is an extension that automatically logs EXPLAIN ANALYZE for slow queries — turn it on.
We’ll dig into how the planner thinks in the Mental Model section.
4.10 Backups and recovery
There are two kinds of backups, and people who confuse them lose data.
Logical backups with pg_dump produce a SQL file (or custom-format archive) of one database. They’re slow on big databases, version-portable, and great for moving data between environments. Restore with psql or pg_restore.
Physical backups with pg_basebackup (or pgbackrest, barman) copy the data directory at the file level, plus enough WAL to make it consistent. They’re fast, support point-in-time recovery (PITR — restore to any moment in time using archived WAL), and are the only realistic option for large databases. They’re not version-portable.
Point-in-time recovery requires WAL archiving (archive_mode = on, archive_command = '...'). Without archived WAL, you can only restore to the moment of the base backup. With it, you can restore to any LSN/timestamp between the base backup and your most recent archived WAL.
For dev databases, pg_dump is fine. For anything that matters, set up pgbackrest or barman and test restores. An untested backup is not a backup.
4.11 Replication, in 30 seconds
Postgres has two kinds of replication:
Physical (streaming) replication: the standby has a byte-identical copy of the primary’s data files. The primary streams its WAL to the standby, which replays it. The standby is read-only (“hot standby”) and can be promoted to primary on failover. All databases in the cluster are replicated; you can’t pick. This is the workhorse for HA.
Logical replication: the primary “decodes” its WAL into row-level changes and ships them to subscribers. The subscriber is a normal Postgres database — it can write, have its own schema, even subscribe to multiple publishers. Use it for selective replication (specific tables), version upgrades (replicate from v15 to v17, then cut over), or feeding downstream systems.
Both use the WAL. Both require wal_level = replica (physical) or logical (logical). Both can use replication slots. We’ll go deeper later.
4.12 Configuration
postgresql.conf is the main config file. The settings that matter on day one:
shared_buffers: ~25% of RAM, max ~40%. Postgres’ page cache.effective_cache_size: ~50–75% of RAM. Not an allocation — it’s the planner’s hint for “how much memory the OS file cache plus shared buffers probably hold.” Higher → planner prefers index scans.work_mem: per-operation memory for sorts and hashes. Default 4MB is too low for analytical work. Be careful: a query with 5 sorts × 100 connections can use 500 ×work_memsimultaneously. Set conservatively globally; bump per-session for analytical queries.maintenance_work_mem: memory forVACUUM,CREATE INDEX,ALTER TABLE. Default 64MB is way too low. Bump to 1–2GB on a real server.max_connections: default 100. Going much higher without a connection pooler is a bad idea — see Mental Model.wal_level:replica(default in modern versions) supports replication;logicalsupports logical replication. Setting it lower is rarely worth it.checkpoint_timeoutandmax_wal_size: control how often checkpoints happen. Defaults are conservative; loosening them reduces I/O at the cost of longer recovery time. We’ll discuss in Judgment Calls.
ALTER SYSTEM SET shared_buffers = '4GB' writes to postgresql.auto.conf (overrides the main config). Some settings need a restart, others a reload (SELECT pg_reload_conf()). SHOW shared_buffers to see the current value.
4.13 Extensions
CREATE EXTENSION pg_stat_statements;
CREATE EXTENSION pgcrypto;
CREATE EXTENSION postgis;
Extensions are how Postgres adds capability without touching core. The ones nearly every production deployment should have:
- pg_stat_statements: tracks aggregated stats per query template. The single most useful tool for “what’s slow.” Requires being added to
shared_preload_librariesand a restart. - auto_explain: logs slow query plans automatically.
- pgcrypto: cryptographic functions.
- pg_trgm: trigram similarity search — fuzzy text matching, fast LIKE on indexed columns.
Big extensions worth knowing exist:
- PostGIS: geographic data. The de facto GIS database.
- TimescaleDB: time-series data with automatic partitioning.
- pgvector: vector similarity search for ML embeddings.
- Citus: turn Postgres into a distributed sharded cluster.
The default install only contains the “contrib” extensions (pg_stat_statements, pgcrypto, hstore, pg_trgm, etc.). The big ones are separate packages.
4.14 The minimum monitoring set
Three queries you will run often:
-- What's running right now?
SELECT pid, age(clock_timestamp(), query_start), state, wait_event, query
FROM pg_stat_activity
WHERE state != 'idle' ORDER BY query_start;
-- What queries are slow on average?
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 20;
-- Table-level health
SELECT relname, n_live_tup, n_dead_tup,
round(100.0*n_dead_tup/NULLIF(n_live_tup+n_dead_tup,0),1) as dead_pct,
last_vacuum, last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC LIMIT 20;
Memorize these. They’ll resolve 80% of the “what’s wrong with the database?” questions you’ll ever ask.
That’s the distilled introduction. Everything from here on builds on these basics — but also on the Mental Model, which is what makes Postgres make sense rather than just be a list of commands.
5. The Mental Model
If you take away nothing else from this document, take these four ideas. Every Postgres behavior you’ll ever encounter — every gotcha, every config knob, every operational quirk — falls out of one of these.
Core Idea 1: An UPDATE is a DELETE plus an INSERT.
This is the foundational fact, and it is not a metaphor. Postgres physically does not modify a row in place when you UPDATE it. It writes a new tuple, stamps the old one’s xmax with the updating transaction’s XID, and leaves the old tuple right where it was. Both versions coexist on the same page (or on different pages if there isn’t room for the new one nearby).
This single fact predicts an enormous amount:
- Tables grow even when you never
INSERT. A workload that only updates a fixed set of rows still grows the heap, because every update writes a new tuple. Without vacuum, the table grows linearly with update count. SELECTcan be slower than you expect on a high-update table. A sequential scan reads dead tuples too — it has to, because it doesn’t know which ones are dead until it checks visibility. A 1GB table with 5GB of dead tuples scans 6GB.- Indexes have to be updated on every
UPDATE, even if the indexed column didn’t change. The new tuple lives at a different physical location, so its index entry must point there. (HOT updates are an optimization for the case where no indexed column changes and the new tuple fits on the same page — see Section 7. They’re the exception, not the rule.) - Rolling back an
UPDATEis free. The new tuple’sxminbelongs to a transaction that aborted; visibility rules will skip it. The old tuple’sxmaxbelongs to the same aborted transaction; visibility rules ignore that too. No undo log, no rewriting. The cost is paid later when vacuum cleans up the now-truly-dead new tuple. COMMITis incredibly cheap. It writes one record to WAL (“XID 12345 committed”), flushes WAL to disk, and that’s it. No data pages are written at commit time.
The cost of this design is that someone has to clean up dead tuples eventually — and that’s VACUUM. The asymmetry is: writes and rollbacks are fast, reads can be slow if vacuum falls behind. Postgres trades read efficiency for write efficiency and concurrency.
Core Idea 2: Visibility is a comparison, not a structure.
When a transaction reads a tuple, Postgres doesn’t look up the tuple in some “current versions” index. It reads the page, finds all the tuples on it (live or dead), and runs a visibility check on each: given my snapshot, can I see this tuple? The check is simple:
- If the tuple’s
xminhasn’t committed by my snapshot’s cutoff, skip it (the inserting transaction wasn’t done yet when I started). - If the tuple’s
xmaxhas committed by my snapshot’s cutoff, skip it (it was deleted before I started). - Otherwise, show me this tuple.
The tuple itself doesn’t know if it’s “current.” Pgsql doesn’t maintain a “latest version pointer.” Visibility is computed at read time, per reader, against that reader’s snapshot.
This predicts:
- Different transactions see different rows simultaneously, with no contention. Two readers and a writer all hit the same page at the same time. They each compute visibility independently. The writer doesn’t lock anyone out; the readers don’t lock the writer out. This is the “readers don’t block writers, writers don’t block readers” promise of MVCC.
- A long-running transaction holds back the cleanup of every dead tuple in the entire database. Vacuum can only remove tuples that are dead from every live transaction’s perspective. If you have a transaction that started 6 hours ago, vacuum can’t clean up tuples that died 5 hours ago — they might still be visible to that transaction. This is the operational gotcha. Discussed extensively in Section 7.
- The
xminhorizon is a real, physical thing. There’s a value, sometimes called the “xmin horizon,” equal to the oldest XID that any active transaction (or replication slot, or prepared transaction) might still need. Vacuum can free anything older than that. You can monitor it:SELECT min(backend_xmin) FROM pg_stat_activity. - Hint bits are a side effect of this checking. The first reader to see an uncommitted-looking tuple has to check the commit log to determine its fate. After that, it sets a hint bit on the tuple (“xmin committed” or “xmin aborted”) so future readers can skip the lookup. This means reading a freshly-written page can dirty it — surprising the first time you encounter it.
Core Idea 3: WAL is the database. The data files are a derivative.
Every change in Postgres is described in the WAL before the corresponding page is modified in shared buffers. The page itself isn’t immediately written to disk — that happens later, by the checkpointer or background writer. The contract is: when COMMIT returns, the WAL record is fsync’d to disk; the data files might still hold stale versions of the modified pages, possibly for minutes.
This is the “write-ahead” in write-ahead logging. The invariant is: WAL is durable before any data page modification is acknowledged. Crash recovery works by replaying WAL from the last checkpoint forward, applying every change to the data pages.
This predicts:
- A transaction commit is one WAL flush, not many page writes. This is what makes high write throughput possible. You’re paying one fsync per commit, not N writes for N modified pages.
- Replication is just shipping the WAL. The standby’s only job is “receive WAL, replay it on local data files.” That’s the entire mechanism. Streaming replication, PITR, logical replication — all variants of the same idea.
- A backup is a copy of the data files plus enough WAL to make them consistent.
pg_basebackupcopies files while the database is running (so the copy is fuzzy — half the pages are from before yourINSERT, half are from after) and includes the WAL needed to roll the fuzzy copy forward to a consistent point. - Crash recovery time is bounded by checkpoint frequency. After a crash, Postgres replays WAL from the last checkpoint. If checkpoints are 30 minutes apart, recovery can be up to 30 minutes of WAL replay. Make checkpoints more frequent → faster recovery, more I/O. Make them less frequent → slower recovery, less I/O.
- WAL grows with write volume, not data volume. A workload that updates the same row a billion times generates a billion WAL records but a tiny amount of net data change. WAL is unbounded between checkpoints — that’s why
max_wal_sizeis a thing, that’s why a stuck replication slot can fill your disk. fsync = offis a foot-cannon, not a foot-gun. Disable fsync and Postgres becomes “fast” at the cost of “data lives.” A crash will leave you with a permanently corrupted database. Don’t.
Core Idea 4: One client connection equals one OS process.
When a connection arrives, the postmaster fork()s a new backend process. That process is the connection — it has its own PID, its own memory (work_mem, plan cache, catalog cache), its own file descriptors. When the client disconnects, the process exits.
This is the most consequential architectural choice in Postgres for application developers. It predicts:
- Connections are expensive. Forking a process costs single-digit milliseconds. The process baseline is ~5–10MB. At hundreds of connections you’re using gigabytes of RAM just on backend overhead.
max_connectionsdefaults to 100, and you mostly shouldn’t raise it much higher than 200–500. More connections means more processes means more context switching and more memory. The sweet spot for most servers is “as low as you can get away with.”- You need a connection pooler. Web apps with thousands of concurrent users can’t have thousands of Postgres backends. PgBouncer (or built-in pooling in some managed services) accepts thousands of client connections and multiplexes them onto a much smaller pool of real backends. This is not optional in production — it’s the standard architecture.
- Idle connections still consume real resources. Each one is a process holding RAM. They show up as
idleinpg_stat_activity. Setidle_in_transaction_session_timeoutso a forgotten BEGIN doesn’t pin resources forever. - Per-backend caches don’t share. Catalog cache, plan cache, prepared statements — each backend has its own. With pgbouncer in transaction mode, prepared statements break (different transactions get different backends; the prepared statement only exists on one). This is fixable in modern pgbouncer but you have to know about it.
- Crashes are isolated. If a backend segfaults (it shouldn’t, but it can — buggy extension, hardware error), only that backend dies. The postmaster notices, kills all the other backends to prevent shared-memory corruption, and restarts. This is more resilient than thread-based databases where one segfault kills everyone — but it does mean a crashed backend takes down all current sessions briefly.
These four ideas together explain Postgres. Every operational quirk you’ll meet for the rest of your career is a corollary of one of them.
6. The Architecture in Plain English
Let’s walk through what actually happens when you do things, end-to-end. This is the same machinery the Mental Model abstracts; seeing the pieces move makes the model concrete.
A connection arrives
A client TCP-connects to port 5432. The postmaster’s listen socket accepts it. The postmaster calls fork() to create a child process. The child immediately begins authentication — reads pg_hba.conf, matches the connection against the rules, runs the configured auth method (password check, LDAP, certificate, etc.). If auth succeeds, the child becomes a backend, identifies which database to attach to, opens connections to shared memory, loads the system catalogs into its catalog cache, and tells the client “ready for query.”
Why does authentication happen in the child rather than the postmaster? Because authentication can block on slow libraries (LDAP, PAM, SSL handshakes). Putting it in a child means a slow auth doesn’t slow down accepting other connections. This is the same logic that motivates threads in other databases — Postgres just achieves the isolation with processes.
Once attached, the backend reads queries from the client over the wire protocol, processes them, and writes responses back. It does this until the client disconnects, at which point the backend proc_exit()s and the process terminates.
A query runs
A query string arrives at the backend. Five stages happen:
- Parse: turn the string into an abstract syntax tree. Catches syntax errors. Cheap.
- Analyze + Rewrite: look up table names in the catalog cache, resolve column references, expand views (views are rewritten into the underlying queries), apply rules. Now you have a query tree — a logical description of what to compute, not how.
- Plan: the optimizer’s job. Generate possible plans, estimate the cost of each, pick the cheapest. We’ll come back to this.
- Execute: walk the plan tree, pulling tuples through it. Each plan node has a “next tuple” interface (in the classic iterator model) or a vectorized execution path in newer code paths. The executor reads pages through the buffer manager, runs visibility checks, applies predicates, computes outputs.
- Send: format result rows according to the wire protocol and ship them to the client. For large results, this happens in chunks — you don’t hold the whole result in memory.
Steps 1–3 produce a “plan tree” that — for prepared statements — can be cached and reused. The executor (step 4) is where most CPU time goes for short queries. For long queries, time goes to disk reads, sort/hash operations, and lock waits.
A page is read
The executor needs page 12345 of relation accounts. It calls into the buffer manager: “give me page 12345 of accounts.” Buffer manager:
- Compute a hash of (relation, page number). Look in shared buffers.
- If the page is there (a hit), pin it (so it can’t be evicted while in use), return a pointer.
- If not, find a buffer to evict using a clock-sweep algorithm. If the chosen buffer is dirty, write it out first — synchronously, in this backend.
- Read the requested page from disk (or, more accurately, from the OS file cache; Postgres usually doesn’t
O_DIRECT). Place it in the buffer slot. Return a pointer.
Step 3 is why a SELECT can occasionally take much longer than expected: the backend got unlucky and had to flush a dirty buffer before it could satisfy its read. The background writer exists to make this rare — it pre-flushes dirty buffers in the background. Configure bgwriter_lru_maxpages and friends to control how aggressive it is.
A tuple is updated
You run UPDATE accounts SET balance = balance - 100 WHERE id = 1;. The backend:
- Looks up
accountsin the catalog. Finds the relfilenode, knows what columns exist. - Finds the row to update. If there’s an index on
id(there is — primary key), it’s an index scan: walk the btree, get the tuple’s page+offset (ctid), fetch the page through the buffer manager, locate the tuple. - Acquires a row-level lock on the tuple (specifically, sets the tuple’s
xmaxto the current XID with a “for-update” hint, blocking other writers). - Computes the new tuple value. Looks for free space on the same page (via the page header). If there’s room, the new tuple goes on the same page. If not, it goes to a different page — the free space map tells the backend where.
- Writes a WAL record describing the change. The WAL record includes enough information to redo the change after a crash: the old and new tuple, the affected pages, etc.
- Modifies the page(s) in shared buffers. The old tuple’s
xmaxis set; the new tuple is appended. - If the indexed columns changed (or if the new tuple doesn’t fit on the same page → not a HOT update), updates each index to point to the new tuple. Each index update is its own WAL record.
- Returns the row to the client (
UPDATE 1, or withRETURNING, the new row).
When the transaction commits, one more WAL record is written (“XID committed”) and the WAL up through that record is flushed to disk via fsync. The data pages remain dirty in shared buffers. Eventually — on the order of minutes — the checkpointer or background writer writes them out.
A checkpoint happens
Either checkpoint_timeout (default 5min) or accumulated WAL exceeding max_wal_size (default 1GB) triggers a checkpoint. The checkpointer process:
- Records the current LSN as the “redo point” — recovery after a crash will start replaying WAL from here.
- Walks shared buffers, identifying dirty pages.
- Writes them out to disk, throttled over
checkpoint_completion_target×checkpoint_timeoutseconds (default 0.9 × 5min = 4.5 minutes). The throttling exists so you don’t slam I/O for 30 seconds and then idle for 4.5 minutes. fsyncs the data files.- Writes a checkpoint record to WAL.
- Updates
pg_control(a tiny file in the data directory) with the new redo point.
After step 6, the WAL up to the redo point is no longer needed for recovery, and the segments can be recycled or archived.
If the checkpoint takes longer than the next checkpoint trigger, you’ll see warnings in the logs about checkpoints occurring too frequently. That means your write workload has outpaced your I/O — either raise max_wal_size (more WAL accumulation, longer recovery) or scale up I/O.
Autovacuum kicks in
Periodically (every autovacuum_naptime, default 1 minute), the autovacuum launcher wakes up. For each database in the cluster, it checks the per-table statistics (live tuples, dead tuples, last vacuum time). For each table, it computes:
- The vacuum trigger:
n_dead_tup > autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor × n_live_tup(default: 50 + 0.2 × n_live_tup). - The analyze trigger: similar, with separate parameters.
- The wraparound trigger:
age(relfrozenxid) > autovacuum_freeze_max_age(default: 200 million transactions).
Tables exceeding any threshold get queued. The launcher spawns autovacuum worker processes (up to autovacuum_max_workers, default 3) which pick tables off the queue and VACUUM them.
A worker doing VACUUM on a table:
- Acquires a
SHARE UPDATE EXCLUSIVElock — blocks other vacuums and DDL, but not normalSELECT/INSERT/UPDATE/DELETE. - Scans the heap, looking at each tuple. Records dead tuple ctids in memory (
autovacuum_work_mem). - Once it has a batch (or fills memory), visits each index and removes index entries pointing to dead ctids.
- Comes back to the heap and marks dead tuple slots as reusable. Updates the free space map.
- Updates the visibility map for pages that now contain only all-visible tuples.
- If
VACUUM ANALYZE(or autovacuum decided to also analyze), samples rows and updatespg_statistic. - Releases the lock.
Plain VACUUM does not return space to the OS. The table file stays the same size, but holes inside it can be reused for future inserts/updates. This is fine for most workloads. VACUUM FULL rewrites the table to a new file and gives space back to the OS — but it takes ACCESS EXCLUSIVE and blocks everything. There’s also pg_repack (extension), which does the same as VACUUM FULL but online, with brief locks.
A standby replays WAL
The primary has a walsender process for each connected standby. The standby has a walreceiver process. They speak the streaming replication protocol over a normal TCP connection (using the same protocol as psql, but in “replication mode”).
Workflow:
- The walreceiver tells the walsender “I have up to LSN X. Send me what comes after.”
- The walsender reads WAL records from the WAL files on the primary and streams them to the walreceiver.
- The walreceiver writes them to its local
pg_wal/and signals the startup process. - The startup process is doing continuous WAL replay — reading WAL records from local
pg_wal/and applying them to data pages on the standby. - With hot standby on, the standby is open for read-only queries. Those queries see the state of the database as of however far replay has gotten. There’s a tiny lag.
A replication slot on the primary is just a record: “this consumer is at LSN X.” The primary won’t recycle WAL newer than X, even if the consumer disconnects. This is what makes replication durable across restarts. It’s also why a forgotten slot can fill your disk.
A logical replication change is decoded
Logical replication adds a “decoder” step. The walsender, instead of sending raw WAL, runs each WAL record through a logical decoding plugin (typically pgoutput, the built-in one). The decoder:
- Reads the WAL record. Knows it’s, say, an UPDATE to a tuple in
orders. - Looks up the table’s schema as it was at the LSN of the WAL record (this is why logical replication needs catalog history —
wal_level = logicalkeeps catalog tuples around longer). - Outputs a logical change: “table
orders: UPDATE row {id=42, status=‘shipped’} → {id=42, status=‘delivered’}.” - Buffers changes per transaction; on COMMIT, flushes the whole transaction’s changes to the subscriber.
The subscriber receives these logical changes and applies them as ordinary SQL statements. Because it’s just SQL, the subscriber can have a different schema, additional indexes, write its own data, etc.
That’s the architecture in motion. Now we can talk about what bites.
7. The Things That Bite You
Each of these has burned thousands of engineers. They’re all corollaries of the Mental Model — but you don’t see them coming until you do.
7.1 A long-running transaction blocks vacuum on every table
You leave a psql open with BEGIN; SELECT * FROM small_table; and walk away for the weekend. Meanwhile, your write-heavy events table generates a million dead tuples per hour. Autovacuum runs, but it can’t actually clean anything: visibility rules say “the open transaction’s snapshot might still need these tuples.” After 48 hours, your events table has 96 million unfreeable dead tuples. Disk is filling up. Your queries are slow because seq scans are reading 100x more dead tuples than live ones.
This is the most common bloat cause and the operational reason long transactions are toxic. Mental Model 2 in action: vacuum can only remove tuples older than the oldest live snapshot. One long-running snapshot pins everything.
How to handle it: monitor pg_stat_activity for old transactions (age(now(), xact_start)), set idle_in_transaction_session_timeout (kills idle-in-tx sessions), set statement_timeout for queries, and educate everyone that “I’ll just leave this BEGIN open” is not a thing. Same applies to abandoned replication slots and prepared transactions — both pin xmin.
7.2 An UPDATE to a row keeps growing the table
You have a counters table with 100 rows. Your app increments a counter on every request. After a week, the table is 8GB. You’re confused — there are still 100 rows.
Mental Model 1: every UPDATE creates a new tuple. The old tuples accumulate as bloat. Vacuum reuses space within the existing file but doesn’t shrink it. If updates outpace vacuum, the file grows.
Fixes: design counter tables to be append-only with periodic rollups; set aggressive per-table autovacuum settings on hot tables (ALTER TABLE counters SET (autovacuum_vacuum_scale_factor = 0.01)); reduce fillfactor (more on this in 7.3) so HOT updates can keep up. For an existing bloated table, pg_repack rewrites it online.
7.3 HOT updates aren’t as common as you think
Postgres has an optimization called “Heap-Only Tuple” updates. If (a) no indexed columns changed, and (b) there’s room on the same page, the new tuple goes on the same page and no index update is needed. The index entry still points to the original tuple, which has a forward pointer to the new one. Vacuum can clean these up incrementally without a full index pass.
The problem: HOT requires room on the same page. If the page is full, a non-indexed-column update still has to write to a different page, which means an index update, which defeats HOT. The fillfactor parameter (default 100% for tables) controls how much space to leave free on each page when bulk-loading. Set fillfactor = 70 on update-heavy tables and you give HOT room to work.
Also: any new index on the table can convert HOT-eligible updates into non-HOT. Adding an index to a hot table can mysteriously make UPDATE-heavy workloads slower because of bloat.
7.4 The “wraparound” panic
You wake up to logs full of WARNING: database "prod" must be vacuumed within X transactions. Eventually ERROR: database is not accepting commands to avoid wraparound data loss in database.
XIDs are 32-bit. The counter wraps. Tuples need to be “frozen” before their XID becomes the oldest possible XID, which the next write would alias. Vacuum freezes tuples; if vacuum has been failing or absent for hundreds of millions of transactions, you approach the wraparound point. At 200M transactions of XID age (default autovacuum_freeze_max_age), Postgres starts running emergency anti-wraparound vacuums automatically. At ~3M from the limit, it shuts down writes.
Fixes (in increasing severity): make sure autovacuum is on and not blocked; add temporary aggressive vacuum freeze commands; if you’re already shut down, shut Postgres down, restart in single-user mode, run VACUUM FREEZE manually, restart normally. PG 14+ has a “failsafe” mode that bypasses cost limits when wraparound is imminent — read about it.
What blocks vacuum from freezing? Same as 7.1: long-running transactions, replication slots, prepared transactions. Wraparound is the catastrophic version of the same root cause.
7.5 ANALYZE is silent and important
You add an index. Queries don’t get faster. You confirm the index exists. You EXPLAIN and see it still doing seq scan. WTF.
Possibilities: (a) the table is small enough that seq scan is cheaper, (b) the predicate doesn’t actually match the index, (c) statistics are wrong and the planner thinks the predicate matches too many rows.
(c) is the surprising one. ANALYZE populates pg_statistic with histograms and most-common-values lists. The planner uses these to estimate row counts. Out-of-date or low-resolution statistics → bad estimates → bad plans. After bulk loads, after major data changes, after schema changes — ANALYZE is what makes the planner sane. Autovacuum runs ANALYZE automatically but only when the analyze threshold is hit; for fresh data it might be lagging.
The deeper subtlety: default_statistics_target (default 100) controls histogram resolution. For columns with skewed distributions or millions of distinct values, 100 is too low — bump to 500 or 1000 for those columns with ALTER TABLE ... ALTER COLUMN ... SET STATISTICS 500. Even more advanced: extended statistics (CREATE STATISTICS) for correlated columns. Without them, the planner assumes columns are independent — WHERE country = 'US' AND state = 'CA' is estimated as P(US) × P(CA), which is wildly wrong.
7.6 “Idle in transaction” is more dangerous than “active”
pg_stat_activity has a state column with values like active, idle, idle in transaction. The dangerous one is idle in transaction — a backend that ran BEGIN, did some work, and is now sitting there with the transaction open, waiting for the application to send the next command.
Connections that are merely idle (no transaction open) hold no XID, hold no snapshot, block no vacuum. They consume only RAM. Connections that are idle in transaction hold a snapshot, hold any locks they acquired, prevent vacuum from cleaning anything newer than their snapshot. They’re 7.1 in disguise.
Always set idle_in_transaction_session_timeout in production. 30 seconds to a few minutes is reasonable. The application gets an error if it tries to use the connection after timeout — that’s a feature, it forces the bug to be visible rather than silent disk fill.
7.7 Replication slots can fill your disk
You set up logical replication for a temporary purpose, then turned the subscriber off without dropping the slot. The primary obediently retains every WAL segment newer than the slot’s confirmed LSN. After two weeks of writes, pg_wal/ is 500GB and the disk is full and Postgres can’t even write the next WAL record to commit a transaction. Now you’re really in trouble.
SELECT slot_name, active, pg_current_wal_lsn() - confirmed_flush_lsn AS lag FROM pg_replication_slots — that lag in bytes is what’s pinned. If a slot is active = false and you don’t need it, drop it. If it’s active but a downstream consumer is broken, fix the downstream or drop the slot. Set max_slot_wal_keep_size (PG 13+) to bound this — slots beyond the limit get invalidated rather than filling your disk.
7.8 Indexes get bloated too
Indexes have their own bloat. Btree indexes are particularly prone to it because of their split-and-don’t-merge behavior on deletions and updates. A heavily-updated indexed column can leave an index that’s twice the size it needs to be. pg_stat_user_indexes shows usage; pgstattuple (extension) shows bloat. The fix is REINDEX CONCURRENTLY index_name (PG 12+) — rebuilds without long locks.
Symptoms: index size growing faster than table size, or growing despite the table being stable. After major updates, a periodic reindex of the busiest indexes is healthy.
7.9 Connections aren’t free; idle connections aren’t free either
Mental Model 4 again. A web app with 500 backends consumes a few GB just on backend overhead, plus per-backend work_mem allocations during queries. At thousands of connections you’re context-switching the kernel into the floor.
The fix is always pgbouncer (or a similar pooler). The trap: in transaction pooling mode (where pgbouncer multiplexes multiple clients onto each backend, switching at transaction boundaries), session-level features break. Server-side prepared statements break (different transactions get different backends; the cached plan only exists on one). LISTEN/NOTIFY breaks. Temporary tables with ON COMMIT PRESERVE ROWS break. Advisory session locks break. Either avoid these features, or use session pooling (much weaker savings), or use direct connections for the operations that need them.
Modern pgbouncer (1.21+) supports protocol-level prepared statements in transaction mode, which fixes the most painful case. Read your pooler’s docs.
7.10 SELECT count(*) is slower than you expect
Postgres has no count(*) shortcut. There’s no metadata that says “this table has N rows.” Why? Mental Model 2: visibility is per-snapshot. Different transactions see different counts. There’s no single “the count” to cache.
So SELECT count(*) FROM big_table does a sequential scan, applying visibility checks. With an index-only scan and the visibility map up to date, it can use the index, but it still has to count every entry.
Workarounds: use pg_class.reltuples for an approximate count (updated by vacuum/analyze; not exact), or maintain a counter table updated by trigger (introducing contention). Or just don’t run count(*) on huge tables on the request path.
7.11 The planner doesn’t know what your application knows
You write SELECT * FROM events WHERE user_id = $1 AND created_at > now() - interval '1 day'. There’s an index on (user_id, created_at). But the planner does a seq scan instead of an index scan. Why?
The planner estimates: how many rows match user_id = $1? If it estimates more than ~5–10% of the table, it picks seq scan because seq scan is faster than random index lookups when the result set is large. If the user has a million events and the planner doesn’t know that this particular user is high-volume, it might mis-estimate.
Tools: EXPLAIN ANALYZE to see actuals vs estimates. If estimates are off by 10x+, that’s your problem. Solutions: better statistics (ALTER TABLE ... SET STATISTICS), extended statistics for correlated columns, or — last resort — query hints via pg_hint_plan (extension; not in core). Most of the time, the right fix is fixing statistics.
There’s also “row constructor weirdness,” “OR vs UNION,” “window function placement” — each of which can cause plans you don’t expect. The root cause is always: the planner has different information than you do. EXPLAIN is your only reliable feedback loop.
7.12 The serial and identity gap problem
You insert rows with a bigserial primary key. The IDs are 1, 2, 3, 5, 6 — what happened to 4? Probably a transaction did INSERT, got ID 4, then rolled back. The sequence advanced; the row never committed; the ID is gone.
This is by design. Sequences are non-transactional — they don’t roll back, because rolling them back would require locking the sequence and serializing all inserts. So sequences guarantee uniqueness and monotonicity (within a session), but not density. Don’t depend on “no gaps.” If you do, you’re using sequences wrong.
8. The Judgment Calls
These are decisions where reasonable engineers disagree, and the right answer depends on your situation. A staff engineer earns the title by knowing the signals that point to the right choice, not by memorizing one answer.
8.1 Connection pooling: PgBouncer or built-in?
Situation: You have a web app with thousands of clients. You can’t have thousands of Postgres backends.
Option A (PgBouncer in front of Postgres, transaction pooling): Run PgBouncer as a separate process between your app and Postgres. Configure pool size to match Postgres’ max_connections minus headroom. Use transaction pooling for max multiplexing. Tradeoffs: another moving piece, debugging is harder (“is this query slow or am I queued in pgbouncer?”), session-level features break in transaction mode.
Option B (pool in your application driver): HikariCP, pgx pool, SQLAlchemy pool. Each app instance maintains its own pool. Tradeoffs: simpler stack, but if you have N app instances each holding M connections, you still need N×M backends. With more than a few app instances, this stops scaling.
What experienced engineers actually do: Both, in series. App-side pool sized small (5–20), pointing at PgBouncer, which holds the real connections to Postgres. App-side pool eliminates connection-establishment overhead per request; PgBouncer eliminates fan-out. This is the default architecture for any app at meaningful scale.
Signal: how many app instances do you have? <3, you can probably skip PgBouncer. >5, you definitely need it.
8.2 Synchronous vs asynchronous replication
Situation: You’re setting up a replica for HA.
Option A (asynchronous, the default): Primary commits as soon as WAL is local-fsync’d. Standby may lag by milliseconds to minutes. On failover, you might lose committed transactions that hadn’t replicated yet.
Option B (synchronous, synchronous_commit = on + synchronous_standby_names): Primary waits for the standby to confirm WAL receipt before COMMIT returns. No data loss on failover. Higher commit latency. If the synchronous standby goes down, the primary blocks until you remove it from the list.
What experienced engineers actually do: Async by default. Sync when the cost of one lost transaction exceeds the cost of higher commit latency — financial systems, compliance scenarios. Use synchronous_commit = remote_write (cheaper than remote_apply) as a middle ground: standby has it in WAL but maybe hasn’t applied it yet, which is “good enough” for durability.
Signal: what’s your acceptable RPO (Recovery Point Objective)? “Seconds of data loss is fine” → async. “Zero data loss” → sync, with multiple standbys to survive any single one going offline.
8.3 Physical vs logical replication
Situation: You need data on a second machine.
Option A (physical / streaming): Byte-for-byte copy. The standby is identical to the primary. Read-only on the standby; can’t have different schema, can’t write. All databases replicated. Versions must match.
Option B (logical): Row-level change stream. Subscriber is independent — can have different schema, additional indexes, can write its own data. Selective: replicate specific tables. Versions can differ.
What experienced engineers actually do: Physical for HA and read scaling. Logical for migrations (cutover from old to new), for feeding analytics warehouses, for selective replication, for building event-sourced systems.
Signal: do the source and destination have the same schema and same Postgres version, and do you want all the data? Physical. Otherwise, logical.
8.4 Indexing: what to index and what not to
Situation: Should this column have an index?
Option A (index it): Faster reads matching the column. Option B (don’t): No write overhead, no disk space.
What experienced engineers actually do: Index columns in WHERE/JOIN/ORDER BY of frequent queries. Don’t index everything. Specifically:
- Foreign keys: yes, almost always — joins use them and so does cascade delete.
- Boolean flags: usually no, unless very skewed (a partial index might help).
- Low-cardinality columns: usually no, planner won’t use them anyway.
- Columns updated very frequently: think twice — every update touches the index too.
The pg_stat_user_indexes view shows idx_scan per index. Indexes with idx_scan = 0 after weeks of production traffic are pure overhead. Drop them.
Signal: write EXPLAIN ANALYZE first, see if seq scan is genuinely a problem, then add the index. Don’t speculate.
8.5 jsonb vs normalized columns
Situation: You have a flexible payload — a metadata field with varying keys per row.
Option A (jsonb): Stash the whole blob. Index with GIN if you query keys. Option B (extract into columns): Convert known keys into proper columns with constraints and types.
What experienced engineers actually do: Use jsonb for genuine flexibility — user-defined fields, third-party API responses, configurations. Do not use it as a lazy schema shortcut. If you find yourself extracting metadata->>'email' in five places, email should be a column. The performance gap on indexed queries is significant; the readability gap is huge.
A common pattern: store the raw blob in a jsonb column for archival, but mirror commonly-queried fields into actual columns updated by trigger or application code.
Signal: are you querying or filtering on keys inside the jsonb? Yes → those should be columns or at least have GIN indexes. Just storing it for later display? jsonb is fine.
8.6 Partitioning: when?
Situation: Your table is getting big.
Option A (don’t partition): Single table, indexes, vacuum. Option B (declarative partitioning): Split by range (date ranges), list (categorical), or hash (load balancing).
What experienced engineers actually do: Partition when you have a clear partition key and clear benefits — typically time-series data where you’ll drop old partitions cheaply, or huge tables where index size on a single table is causing pain. For tables under ~100GB without a natural partition key, don’t bother — partitioning adds complexity.
When you do partition, by date is by far the most common pattern (one partition per month or week). pg_partman automates partition creation and maintenance.
Signal: do you have a natural partition key (a date column you always query by) AND a maintenance need partitioning solves (drop old data fast, or split work for parallel scans)? If both yes, partition. If only one, probably not yet.
8.7 TEXT vs VARCHAR(N)
Situation: A new column for a string.
Option A (text): No length limit. Postgres native preference.
Option B (varchar(n)): Limit to N characters.
What experienced engineers actually do: Use text. There’s no performance difference. The only time varchar(n) is useful is when you have a hard external requirement that the value not exceed N characters — and even then, a CHECK (length(col) <= N) constraint is cleaner.
Signal: do you have a real length limit imposed by an external system or compliance? Then varchar(n) or check constraint. Otherwise, text.
8.8 Hash index vs btree
Situation: You’re indexing a column you only ever query with =.
Option A (btree): Default. Supports equality, range, ordering. Option B (hash): Equality only. Smaller. Faster for exact-match.
What experienced engineers actually do: Almost always btree, even for equality. Pre-PG 10, hash indexes weren’t WAL-logged (didn’t replicate, weren’t crash-safe) — that bias persists culturally. Modern hash indexes are fully durable. They can be slightly faster for equality on large keys (long strings). But btree has no real downside for equality, and you might want range later. Hash is rarely worth the special case.
Signal: is this a perf-critical equality lookup with >100GB of data and you’ve benchmarked? Maybe hash. Otherwise btree.
8.9 Migration strategy: blocking vs concurrent
Situation: You need to add an index, add a column, change a type, or drop a column on a busy table.
Option A (the obvious way): CREATE INDEX..., ALTER TABLE.... Takes ACCESS EXCLUSIVE, blocks everything for as long as the operation takes. Fine for small tables; outage for big ones.
Option B (the careful way): Use CONCURRENTLY where available; multi-step migrations otherwise. Adding a non-null column with a default: in older Postgres, this rewrites the entire table; in PG 11+, defaults are stored as metadata. Adding a foreign key: use NOT VALID first, then VALIDATE CONSTRAINT separately (the validate is just a scan, no lock escalation).
What experienced engineers actually do: Treat every schema change as potentially blocking. The right defaults for any production migration:
- Indexes: always
CONCURRENTLY. - New columns: nullable, no default, then backfill, then add
NOT NULLlast. - Column type changes: USING expression cast can rewrite the table; sometimes safer to add new column, copy, rename.
- Dropping columns:
ALTER TABLE DROP COLUMNis a metadata operation and fast — but the data isn’t actually freed until vacuum.
pg_repack is your friend for online table rewrites. Tools like pgroll or gh-ost-style approaches automate the “shadow column / shadow table” pattern.
Signal: how big is the table and how busy is it? Tiny + quiet → just do it. Big + busy → multi-step migration.
8.10 Tuning autovacuum: aggressive or default?
Situation: Your bloat is bad. Autovacuum isn’t keeping up.
Option A (default settings): Conservative thresholds, low cost limits. Designed not to interfere with foreground work.
Option B (aggressive): Lower scale factors, higher cost limits, more workers. Vacuum runs more often, completes faster, uses more I/O.
What experienced engineers actually do: Tune per table. The default autovacuum_vacuum_scale_factor = 0.2 means “wait until 20% of the table is dead.” On a 1B-row table that’s 200M dead tuples — way too late. Lower it per table: ALTER TABLE busy_table SET (autovacuum_vacuum_scale_factor = 0.02). Raise autovacuum_vacuum_cost_limit (default 200) globally if your I/O can take it — modern NVMe can easily handle 1000+. Raise autovacuum_max_workers if many tables need attention.
The trap: aggressive tuning on every table compounds. Better: identify the 5–10 hottest tables, tune those.
Signal: monitor pg_stat_user_tables.n_dead_tup over time. Tables where dead tuples grow between vacuum runs need lower thresholds. Tables where vacuum takes longer each run have growing bloat — go look.
8.11 Where to put work_mem
Situation: You have analytical queries that spill to disk.
Option A (raise globally): work_mem = 256MB in postgresql.conf.
Option B (raise per session/query): SET work_mem = '1GB' before the query.
What experienced engineers actually do: Set conservative globally (16–64MB), bump per session for the queries that need it. Why? Mental Model 4: each backend can use work_mem per sort/hash node. A query with 5 sorts in 100 connections = 500 × work_mem simultaneously. Set globally to 1GB and you can OOM the box.
For OLAP queries on a dedicated reporting connection, set work_mem high in that session only. For an OLTP-heavy server, leave it conservative.
Signal: are you running analytical queries that show “sort method: external merge” in EXPLAIN? They want more work_mem. Set per session.
8.12 RDS / managed Postgres vs self-managed
Situation: You need to run Postgres in production.
Option A (managed): AWS RDS/Aurora, GCP Cloud SQL, Azure DB, Supabase, Neon, Crunchy Bridge. Backups, replication, patching all handled.
Option B (self-managed): You run Postgres yourself on VMs or bare metal.
What experienced engineers actually do: Managed, almost always. The operational toil of running Postgres yourself — backups, replication, upgrades, monitoring, point-in-time recovery, failover testing — is non-trivial. Managed services solve that for a markup. Self-managed makes sense at very large scale (when the markup matters more than the toil), or when you need features the managed service doesn’t expose (specific extensions, custom replication topologies, kernel tuning).
The trap: managed services hide things. You can’t see your processes, you can’t tune kernel parameters, you can’t install arbitrary extensions, you might have a different (often older) Postgres version. AWS Aurora is not really Postgres — it’s a Postgres-compatible engine with a different storage layer; some behaviors differ subtly.
Signal: does your team have a database engineer? Self-managed makes sense. Does “the database goes down at 3am” mean nobody knows what to do? Managed.
9. The Commands and APIs That Actually Matter
Quick-reference grouped by task. The “why” not just the “what.”
Connecting and exploring
psql -h host -p 5432 -U user -d db
\c dbname -- switch databases
\l -- list databases
\dt -- list tables in current schema
\dt schema.* -- tables in a schema
\d tablename -- describe a table (columns, indexes, constraints)
\di+ -- list indexes with sizes
\du -- list roles
\df -- list functions
\dx -- list installed extensions
\dn -- list schemas
\timing on -- show query duration
\x -- expanded display (great for wide rows)
\watch 1 -- rerun last query every second
\e -- edit last query in $EDITOR
\copy -- bulk import/export client-side
\? -- list all metacommands
psql is the operator’s interface. Learn it. Aliases worth setting in ~/.psqlrc:
\set ON_ERROR_ROLLBACK interactive
\set HISTFILE ~/.psql_history- :DBNAME
\timing on
\set VERBOSITY verbose
ON_ERROR_ROLLBACK interactive is the killer feature: in a transaction, if a statement errors, psql automatically SAVEPOINTs and rolls back to before it, so you can keep working instead of having to ROLLBACK and start over.
Inspecting what’s happening
-- Active queries, with how long they've been running
SELECT pid, usename, application_name, state, wait_event_type, wait_event,
age(clock_timestamp(), query_start) AS duration, query
FROM pg_stat_activity
WHERE state != 'idle' AND pid != pg_backend_pid()
ORDER BY query_start;
-- Long-running transactions (the ones that block vacuum)
SELECT pid, usename, age(clock_timestamp(), xact_start) AS xact_age, state, query
FROM pg_stat_activity
WHERE xact_start < now() - interval '5 minutes'
ORDER BY xact_start;
-- Locks: who is blocking whom
SELECT blocked.pid AS blocked_pid, blocked.query AS blocked_query,
blocking.pid AS blocking_pid, blocking.query AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid));
-- Kill a query (be careful)
SELECT pg_cancel_backend(pid); -- polite, sends SIGINT
SELECT pg_terminate_backend(pid); -- impolite, kills the backend (closes connection)
pg_blocking_pids() is a function added in PG 9.6 that returns the PIDs blocking a given PID. Before that, “who is blocking whom” required a much uglier join.
Performance investigation
-- Top queries by total time spent (requires pg_stat_statements)
SELECT calls, mean_exec_time::int AS mean_ms, total_exec_time::int AS total_ms,
(100 * total_exec_time / sum(total_exec_time) OVER ())::int AS pct,
query
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 30;
-- Reset pg_stat_statements stats
SELECT pg_stat_statements_reset();
-- Cache hit ratio (>99% is healthy for OLTP)
SELECT sum(blks_hit)::float / nullif(sum(blks_hit + blks_read), 0) AS cache_hit_ratio
FROM pg_stat_database;
-- Index usage
SELECT schemaname, relname, indexrelname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;
-- Indexes that are never used
SELECT schemaname, relname, indexrelname, pg_size_pretty(pg_relation_size(indexrelid))
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;
EXPLAIN (ANALYZE, BUFFERS, VERBOSE, SETTINGS) is the canonical incantation. The SETTINGS option (PG 12+) shows non-default config affecting the plan, which surfaces “oh, enable_seqscan = off is set in this session” gotchas.
Sizing things
-- Database size
SELECT pg_size_pretty(pg_database_size('mydb'));
-- Tables sorted by size (heap + indexes + toast)
SELECT relname, pg_size_pretty(pg_total_relation_size(oid)) AS total,
pg_size_pretty(pg_relation_size(oid)) AS heap,
pg_size_pretty(pg_total_relation_size(oid) - pg_relation_size(oid)) AS indexes_and_toast
FROM pg_class WHERE relkind = 'r'
ORDER BY pg_total_relation_size(oid) DESC LIMIT 20;
-- Bloat estimate (approximate; pgstattuple is exact but expensive)
SELECT schemaname, relname, n_live_tup, n_dead_tup,
round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 1) AS dead_pct
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC LIMIT 20;
pg_total_relation_size includes the heap, all indexes, and the TOAST table (out-of-line storage for large values). pg_relation_size is just the heap. Most “table size” alarms are answered by pg_total_relation_size.
Maintenance
VACUUM; -- vacuum the current database (all tables) — slow, rarely done manually
VACUUM tablename; -- vacuum one table
VACUUM (VERBOSE, ANALYZE) tablename;
VACUUM FULL tablename; -- ACCESS EXCLUSIVE; rewrites the table
ANALYZE tablename; -- update statistics only
REINDEX TABLE tablename; -- rebuild all indexes (locking)
REINDEX INDEX CONCURRENTLY indexname; -- rebuild one index, online
CLUSTER tablename USING indexname; -- physically reorder by index (locking)
In production you almost never run these manually except ANALYZE (after bulk loads) and REINDEX CONCURRENTLY (for bloated indexes). Trust autovacuum for the rest.
Replication and WAL
-- Are we a primary or replica?
SELECT pg_is_in_recovery();
-- Where are we in the WAL? (primary)
SELECT pg_current_wal_lsn();
-- Replication status (primary)
SELECT application_name, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication;
-- Replication slots
SELECT slot_name, slot_type, active, restart_lsn,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots;
-- On a replica: how far behind?
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;
-- Force a checkpoint (for testing)
CHECKPOINT;
Roles and permissions
CREATE ROLE alice WITH LOGIN PASSWORD '...';
CREATE ROLE devs; -- group role, no LOGIN
GRANT devs TO alice;
GRANT CONNECT ON DATABASE mydb TO devs;
GRANT USAGE ON SCHEMA public TO devs;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO devs;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO devs;
The ALTER DEFAULT PRIVILEGES is the one new users miss — without it, tables created in the future have no permissions for devs.
Schema introspection
The information_schema is the SQL-standard way; pg_catalog is the Postgres native way and has more detail. For most ops work, \d in psql is enough. For programmatic access:
-- All tables in schema
SELECT relname FROM pg_class JOIN pg_namespace ns ON ns.oid = relnamespace
WHERE nspname = 'public' AND relkind = 'r';
-- All indexes on a table
SELECT i.relname, idx.indisunique, idx.indisprimary, pg_get_indexdef(idx.indexrelid)
FROM pg_index idx
JOIN pg_class i ON i.oid = idx.indexrelid
JOIN pg_class t ON t.oid = idx.indrelid
WHERE t.relname = 'mytable';
Backup essentials
# Logical backup of one database
pg_dump -h host -U user -d mydb -Fc -f mydb.dump
# Restore
pg_restore -h host -U user -d mydb mydb.dump
# Physical base backup (creates a startable data directory)
pg_basebackup -h host -U replication_user -D ./backup -X stream -P
# WAL receiver (for continuous archive)
pg_receivewal -h host -U replication_user -D ./wal_archive
Use -Fc (custom format) for pg_dump. It’s compressed, supports parallel restore, and lets you cherry-pick tables on restore (pg_restore -t tablename). Plain SQL dumps are fine for small databases and for diff/version-control purposes.
Configuration
SHOW shared_buffers;
SHOW ALL; -- all settings
SELECT * FROM pg_settings WHERE name LIKE '%vacuum%';
-- Set for cluster (writes postgresql.auto.conf)
ALTER SYSTEM SET work_mem = '32MB';
SELECT pg_reload_conf(); -- if the setting is reload-able
-- Set for current session only
SET work_mem = '256MB';
-- Set for one transaction
BEGIN;
SET LOCAL statement_timeout = '5s';
...
COMMIT;
ALTER SYSTEM is the right way to change settings on a running server — it survives restarts and writes to a file that the cluster reads on startup. Direct postgresql.conf editing also works but is harder to automate.
10. How It Breaks
When Postgres misbehaves, the failure modes cluster into a small number of patterns. Recognizing the pattern from the symptom is the experienced-engineer skill.
Symptom: “Disk is filling up”
Possible causes:
- WAL is accumulating — replication slot stuck,
archive_commandfailing, ormax_wal_sizetoo small for write rate. Checkdu -sh pg_wal/andpg_replication_slots. - Bloat — autovacuum can’t keep up, or is blocked by long-running transactions/replication slots. Check
n_dead_tupper table; check oldestxact_startinpg_stat_activity. - Logs accumulating —
log_*settings are too verbose. Checklog_directory. - Genuine data growth — your app actually wrote that much. Look at
pg_database_sizeover time.
Diagnose:
du -sh /var/lib/postgresql/<version>/main/*
This decomposes your data dir. base/ is your tables; pg_wal/ is WAL; log/ is logs. Whichever is biggest tells you the category.
Symptom: “Queries are getting slower”
Possible causes:
- Bloat — table has grown 5x its real data size, scans are taking 5x longer.
- Stale statistics — planner picks bad plans.
EXPLAIN ANALYZEand check estimate vs actual. - Index bloat — index larger than expected, more pages to traverse.
- Working set exceeded RAM — cache hit ratio dropped, queries reading from disk.
- Concurrent contention — locks, especially row locks via
SELECT FOR UPDATEpatterns.
Diagnose:
pg_stat_statementsto identify which queries got slower.EXPLAIN (ANALYZE, BUFFERS)on the slow ones.- Check buffer cache hit ratio.
SELECT * FROM pg_locks WHERE not granted— anyone waiting on locks?
Symptom: “Connections refused / too many clients”
Mental Model 4. You’re at max_connections. Either tune your pooler down (you don’t actually need that many backends), or check for leaked connections (idle transactions, processes that never close), or — last — raise max_connections (carefully, costs RAM).
Diagnose: SELECT count(*), state FROM pg_stat_activity GROUP BY state. Lots of idle from one app instance? Pool leak. Lots of idle in transaction? Unclosed transactions. Lots of active? Genuine load.
Symptom: “Replica is lagging”
Possible causes:
- Network bandwidth saturated — primary writes WAL faster than network can ship.
- Replica I/O can’t keep up — replica disk is slower than primary’s, can’t apply WAL fast enough.
- Standby is running long queries —
hot_standby_feedbackkeeping vacuum from running on primary, but actually the lag is on the replay side, paused due to a conflict. - Single-threaded replay — physical replication is single-threaded for replay (with some parallel apply in modern versions). Heavy index builds on primary serialize on replica.
Diagnose:
-- On primary
SELECT application_name, sent_lsn, write_lsn, flush_lsn, replay_lsn,
pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS replay_lag
FROM pg_stat_replication;
The four LSNs (sent → write → flush → replay) tell you exactly where the bottleneck is. Sent ≈ replica’s write_lsn means network is fine; flush_lsn lagging means replica disk is slow; replay_lsn lagging means replica CPU/single-thread is the bottleneck.
Symptom: “It just stopped accepting writes”
Wraparound. Check the logs for “must be vacuumed within X transactions.” Or you ran out of disk for WAL. Or synchronous_standby is configured but no standby is connected.
For wraparound: shut down, restart in single-user mode, run VACUUM FREEZE on the database, restart. Then figure out what blocked vacuum (long transactions, slots, prepared transactions).
Symptom: “Deadlock detected”
Two transactions each hold a lock the other wants. Postgres detects this after deadlock_timeout (default 1s) and aborts one with ERROR: deadlock detected. The error message tells you which two queries were deadlocked and which won.
The fix is application-level: ensure transactions acquire locks in a consistent order. If you always update tables in alphabetical order, or always update users-then-orders (never the reverse), deadlocks vanish.
Symptom: “ERROR: tuple concurrently updated”
You did something like a concurrent ALTER TABLE from two sessions, or you have two sessions trying to update the same catalog row. Rare in normal workloads. If you see it in normal traffic, you have a deeper concurrency bug — probably trying to do schema-like operations in a hot path.
Symptom: “FATAL: out of memory”
Almost always: too high work_mem × too many concurrent queries, or a single query trying to hash a huge result. The OOM killer might also kick in at the OS level — check dmesg. work_mem = 256MB × 100 backends × 4 nodes per query = 100GB possible.
Fix: lower work_mem globally, raise per-query for the few that need it. Make sure the OS isn’t overcommitting memory (vm.overcommit_memory = 2).
Symptom: “Query plan changed and now it’s slow”
Statistics drifted, or the data distribution changed. Run ANALYZE and re-check. If still bad, raise default_statistics_target (or per-column SET STATISTICS). If still bad, look for correlated columns and add CREATE STATISTICS.
If you’re getting different plans for the same query at different times, that’s prepared statement custom vs generic plan logic — Postgres tries both for the first 5 executions, then picks based on cost. Sometimes the generic plan is bad. plan_cache_mode = force_custom_plan per session/query is the escape.
A general debugging workflow
When something is wrong with Postgres, run these in order:
pg_stat_activity— what’s running right now? Anything stuck? Anything that’s been running for hours?pg_locksjoined withpg_stat_activity— anyone blocked on locks? By whom?pg_stat_statements(if installed) — what’s been expensive over time?- Disk usage (
du -shon data dir subdirs) — what’s eating space? pg_stat_replication— replication healthy?- Server log —
tailit. Postgres logs are conservative by default; turn uplog_min_duration_statement,log_lock_waits,log_autovacuum_min_durationand you’ll see the patterns. - OS metrics — CPU, disk I/O (especially I/O wait), memory, network. Postgres performance is dominated by I/O; if disk is saturated, no amount of SQL tuning will help.
The order matters. Always start with what’s running now; don’t grep the logs for problems while a query is hung in front of you. Resolve immediate pain first, then root-cause.
11. The Taste Test
What does Postgres usage by an experienced engineer look like? Side-by-side, beginner vs experienced:
Schema design
beginner:
CREATE TABLE users (
id varchar(36) PRIMARY KEY,
email varchar(255),
created_date varchar(20),
status int,
metadata text
);
experienced:
CREATE TABLE users (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
email text NOT NULL UNIQUE,
created_at timestamptz NOT NULL DEFAULT now(),
status user_status NOT NULL DEFAULT 'pending',
metadata jsonb NOT NULL DEFAULT '{}'::jsonb
);
CREATE TYPE user_status AS ENUM ('pending', 'active', 'suspended', 'deleted');
CREATE INDEX ON users (created_at);
The experienced version: uuid instead of varchar string, text instead of varchar(255), timestamptz instead of stringy date, an actual enum instead of magic ints, jsonb instead of text, NOT NULL constraints, sensible defaults, an index on the column that’ll be used for ordering.
Migrations
beginner:
ALTER TABLE big_table ADD COLUMN new_col text NOT NULL DEFAULT 'foo';
CREATE INDEX idx_users_email ON users (email);
experienced:
-- Step 1
ALTER TABLE big_table ADD COLUMN new_col text;
-- Step 2 (separate deployment, after backfill)
ALTER TABLE big_table ALTER COLUMN new_col SET NOT NULL;
-- Step 3
ALTER TABLE big_table ALTER COLUMN new_col SET DEFAULT 'foo';
-- Indexes always concurrent
CREATE INDEX CONCURRENTLY idx_users_email ON users (email);
The beginner version locks big_table while it rewrites the entire table for the default fill (in old Postgres, anyway), and locks users for the duration of the index build. The experienced version does the same end-state without ever holding ACCESS EXCLUSIVE for long.
Queries on big tables
beginner:
SELECT * FROM events WHERE user_id = 42 ORDER BY created_at DESC;
experienced:
SELECT id, type, created_at, payload
FROM events
WHERE user_id = 42 AND created_at > now() - interval '7 days'
ORDER BY created_at DESC
LIMIT 100;
The experienced version: SELECT * only when needed (returning 4MB rows uses memory and bandwidth); a time-bound restricts the index range and avoids reading old partitions; LIMIT so a misuse can’t pull a million rows.
Configuration
beginner: defaults.
experienced:
shared_buffers = 8GB # 25% of 32GB box
effective_cache_size = 24GB # 75%
work_mem = 32MB # OLTP: keep low; raise per session
maintenance_work_mem = 2GB # for VACUUM, CREATE INDEX
max_connections = 200 # paired with PgBouncer
checkpoint_timeout = 15min # less frequent than default 5min
max_wal_size = 8GB # accommodate writes between checkpoints
checkpoint_completion_target = 0.9 # spread the checkpoint I/O
default_statistics_target = 250 # better plans on skewed data
random_page_cost = 1.1 # SSD: random I/O is nearly seq
log_min_duration_statement = 100ms # log anything slower
log_lock_waits = on
log_autovacuum_min_duration = 1s
log_temp_files = 10MB # alert on disk-spilling sorts
shared_preload_libraries = 'pg_stat_statements,auto_explain'
These aren’t aggressive, they’re appropriate for a real server. Defaults are tuned for a small dev box.
Use of features
An experienced engineer reaches for:
RETURNINGinstead of follow-up queries.ON CONFLICT ... DO UPDATEfor upserts instead of select-then-insert-or-update logic.EXCLUDEconstraints for “no overlapping” invariants.LISTEN/NOTIFYfor lightweight pub/sub within Postgres.- Range types for periods and intervals.
- Window functions for ranking and running totals.
generated always as (... ) storedfor derived columns.CREATE STATISTICSfor correlated columns.- Partial and expression indexes when they buy meaningful selectivity.
An experienced engineer avoids:
- Storing JSON when proper columns will do.
SELECT *on the request path.- Triggers as application logic (“hidden control flow” is hard to reason about).
- Stored procedures for business logic (move it to the application).
- Polling tables instead of
LISTEN/NOTIFYor proper queues. - Long-running transactions (always).
VACUUM FULL(almost always — usepg_repack).- DDL during peak traffic (do it during off-hours, with
lock_timeoutset).
Code review red flags
When you read someone else’s SQL or migration, you should reflexively flag:
- Indexes added without
CONCURRENTLYin a migration on a non-trivial table. SELECT *in production code, especially on tables with TOAST or wide rows.varchar(N)with arbitrary N — that’s a SQL Server tic, not a Postgres one.- Timestamps without timezone — almost always a bug.
SERIALin new tables — useIDENTITY(PG 10+).- Foreign keys without indexes on the referencing column — every cascade and join will scan.
- JSON columns without GIN indexes if you query into them.
COUNT(*)in dashboards or hot paths on big tables — use approximations or summary tables.- No
LIMITon user-facing queries — pagination is mandatory. ORDER BY ... LIMIT 1instead ofORDER BY ... LIMIT 1with an index on the order key — easy to miss the second part.- Bare
SET search_paththat won’t survive pgbouncer’s transaction pooling. - Bulk inserts via INSERT instead of COPY.
pg_dumpagainst a primary during business hours on a busy database — it holds a snapshot for the whole dump.- Application code with no statement_timeout — one runaway query can cripple the database.
The shared character of “good Postgres usage” is respect for the engine. The engine is doing a lot of work to give you ACID, MVCC, durability, replication, extensibility — engineers who know it work with it. Engineers who don’t know it work against it and pay in production pain.
12. Where to Go Deeper
If you want to know more — and you should — these are the resources worth your time, in order of value-per-hour-spent.
The official Postgres documentation (https://www.postgresql.org/docs/current/). This is the rare case where the official docs are actually the best resource. Specifically: Chapter 13 (Concurrency Control), Chapter 14 (Performance Tips), Chapter 19 (Server Configuration), Chapter 25 (Reliability and the WAL), Chapter 26 (Replication), and Chapter 51 (Internals overview). Don’t read it linearly — read the chapter for what you’re working on. The writing is precise; the cross-references are reliable.
“PostgreSQL 14 Internals” by Egor Rogov (free PDF from Postgres Pro). The single best deep-dive book on how Postgres works internally. Covers MVCC, WAL, vacuum, locking, indexes, and the planner with the right level of detail — not a surface-level tour, not a source-code ride-along, but the conceptual machinery. Read this once you’re past the “how do I use it” stage.
“The Internals of PostgreSQL” by Hironobu Suzuki (https://www.interdb.jp/pg/). Free online book, similar territory to Rogov’s, with diagrams. Especially good on the buffer manager, WAL writing, and process structure. Some sections show their age (it predates several modern versions) but the fundamentals are unchanged.
Bruce Momjian’s tech presentations (momjian.us/main/presentations/internals.html). Bruce is a Postgres committer who has been giving conference talks on internals for two decades. The slide decks are more than slides — they’re essentially explanatory documents. “MVCC Unmasked” is a great one. “Inside PostgreSQL Shared Memory” is another.
pg_stat_statements and auto_explain — install both on every Postgres you operate. You’ll learn more about your application’s actual performance from a week of pg_stat_statements data than from a year of guessing. auto_explain automatically logs slow query plans; combined with a tool like pgBadger or pganalyze, you get continuous performance visibility.
Build something to experience it. Run psql against a local Postgres, create a high-update table, watch it bloat with \dt+, watch vacuum run with VACUUM (VERBOSE), watch WAL grow in pg_wal/, set up streaming replication between two local instances and pull the network cable. Half an afternoon doing this will teach you more than a week of reading.
Postgres-Weekly (newsletter) and Planet PostgreSQL (planet.postgresql.org) are good sources for ongoing learning — features in new versions, performance writeups, war stories. Most blog posts won’t be useful to you in their detail, but skimming them tunes your sense of “what’s possible.”
Source code (https://github.com/postgres/postgres). A surprising thing about Postgres: the source is readable. The C is well-commented and the architecture is clean. If you’ve gotten this far and want to understand a specific behavior, src/backend/storage/buffer/, src/backend/access/heap/, src/backend/storage/ipc/, and src/backend/postmaster/postmaster.c are the high-value reads. Don’t try to read the whole thing — pick a behavior, find the code, read just that.
That’s the document. If you read it carefully, you should now be able to:
- Predict why a long-running transaction is making your disk fill up.
- Explain why your hot table is twice the size it should be despite a fixed row count.
- Read an
EXPLAIN ANALYZEoutput and decide whether the planner is wrong or your indexes are. - Configure a basic production Postgres instance with sensible memory and WAL settings.
- Know what to check first when something is slow.
- Have an opinion on when to use logical vs physical replication.
- Understand why
pgbouncerexists and what its tradeoffs are.
That puts you ahead of most people calling themselves Postgres users. The rest is reps — running into things, recognizing them, fixing them, and reading the docs when something doesn’t fit your model. The model in this document is the scaffolding; production experience is what fills it in.
The ideas are mine. The writing is AI assisted