Airflow Deep Intuition
An experienced engineer's guide to Airflow
1. One-Sentence Essence
Airflow is a Python-driven scheduler for batch workflows: it parses Python files into dependency graphs of tasks, decides which tasks are ready to run based on their dependencies and the clock, and hands them to a worker process — and that’s it.
Internalize that and most of Airflow’s quirks stop being surprising. It is not a data processing engine. It does not move your data. It does not transform your data. It is a glorified cron with a dependency-aware brain, a database to remember what it has done, and a UI to show you what is broken. Everything else — the operators, the sensors, the XComs, the executors — is plumbing built around that core job.
This essence has direct consequences. Because Airflow is just a scheduler, it shouldn’t be doing your data work — it should be telling other systems (dbt, Spark, Snowflake, Kubernetes, your Python script) to do the data work. Because it parses Python, the cost of parsing is paid continuously, not just at startup. Because it’s batch-oriented, it has historically been awkward at event-driven and streaming workloads. Hold the essence, and the rest follows.
2. The Problem It Solved
Before Airflow, you had cron. Cron worked great for “run this script every night at 2am.” It worked badly for everything else.
The actual world looks like this: you ingest data from a vendor (which might arrive at 1am, or 3am, or not at all). Then you transform it, but only after the ingest succeeds. Then you load it into a warehouse, but only after the transform succeeds. Then you fire a downstream report, but only if both the warehouse load and a separate marketing data feed are ready. Each step can fail. Each step can be slow. Each step might need to be retried. And six months from now you will need to rerun the entire chain for last Tuesday because someone discovered a bug in the upstream source.
Cron handles none of this. It fires jobs at fixed times and walks away. To express “run B after A succeeds” in cron, people built towers of scripts that wrote sentinel files, polled databases, and sent shell exit codes through pipes. These towers always grew. They were always brittle. Nobody could explain them.
The specific frustration that drove Airflow’s creation, by Maxime Beauchemin at Airbnb in 2014, was this: a data engineer’s real job is expressing dependencies between batch tasks and reasoning about their state over time, but every available tool forced you to express those dependencies indirectly through filesystem hacks and brittle shell glue. Airflow’s insight was: let the engineer write the dependency graph directly, in Python, and have the system figure out scheduling, retries, backfills, monitoring, and history on top of that graph.
The other key design decision: the workflow definition is code. Not YAML. Not a GUI drag-and-drop. Python files. This was controversial at the time and is still the defining property of Airflow. It’s also the source of half of Airflow’s gotchas — see Section 7.
3. The Concepts You Need
You cannot reason about Airflow without these. Skim them once now; come back when later sections reference them.
Workflow structure
-
DAG (Directed Acyclic Graph) — a Python object that represents a workflow: a set of tasks plus the dependencies between them. “Acyclic” matters: A → B → A would be a cycle, and Airflow rejects it. A DAG has a
dag_id(unique name), aschedule(when to run), and astart_date(the earliest moment it makes sense to run). The DAG file is just a Python script that, when executed, defines a DAG object in its module namespace. -
Task — a single unit of work in a DAG. Tasks have IDs, dependencies, and behavior. A task is the abstract description of what to do; a task instance is a specific occurrence of it.
-
Operator — a class that implements a kind of task.
BashOperatorruns a bash command.PythonOperatorcalls a Python function.PostgresOperatorruns SQL. There are hundreds, packaged as providers (e.g.apache-airflow-providers-amazon). When you instantiate an operator inside a DAG, you create a task. Internally everything is a subclass ofBaseOperator. -
Sensor — a special kind of operator whose job is to wait for something: a file to appear, a partition to be written, an upstream DAG to finish. Same dependency model as operators, just designed for “block until X is true.” See Section 7 for why naive sensor usage destroys clusters.
-
Task Instance — a specific run of a task for a specific DAG run. Same task, different days = different task instances. State (success, failed, running, queued, up_for_retry, skipped, etc.) lives at the task instance level, not the task level.
-
DAG Run — one execution of an entire DAG corresponding to one logical date. A daily DAG that has run for 30 days has 30 DAG runs.
-
TaskFlow API — a decorator-based syntax (
@dag,@task) introduced in Airflow 2 that lets you write DAGs as if they were ordinary Python functions, with return values automatically passed through XCom. It’s syntactic sugar over the operator/XCom model — the underlying mechanics are unchanged.
Time
This cluster is the most confusing part of Airflow. Read it twice.
-
Logical date (formerly execution_date) — a timestamp that identifies a DAG run. It is not when the DAG actually executed. For a daily DAG run on January 5th, the logical date is typically January 4th, because the run is processing the data for January 4th. The name “execution_date” misled an entire generation; in Airflow 2.2+ it was renamed
logical_dateto fight that confusion. It still confuses people. -
Data interval — the time range a DAG run is responsible for. For
@daily, the interval for the run with logical date2026-01-04is[2026-01-04 00:00, 2026-01-05 00:00). The DAG run actually executes at the END of its data interval — i.e., shortly after2026-01-05 00:00. This is by design: at the end of the interval, all the data for the interval is available. Two specific variables expose this:data_interval_startanddata_interval_end. Use these. Stop usingexecution_dateandds. -
start_date — the earliest data interval the DAG should consider. Sets the floor for scheduling. Don’t change this casually after deployment — see “Things That Bite You.”
-
catchup — a boolean. If
True, when you deploy a DAG with astart_datein the past, Airflow will create DAG runs for every missed interval betweenstart_dateand now and run them. IfFalse, only the most recent interval gets a run. Default tocatchup=False. You almost never want catchup unless you’ve thought carefully about whether your tasks are idempotent across history. -
Backfill — explicitly re-running a DAG over a historical date range. Different from catchup: backfill is something you trigger deliberately for a specific window.
-
Schedule — when the DAG should run. Can be a cron expression (
"0 2 * * *"), a preset ("@daily"), atimedelta, an Asset (data-aware), or a custom Timetable.
Execution
-
Scheduler — the long-running process that, every few seconds, looks at the database and asks: which DAG runs need to be created? Which task instances are ready to run? It does NOT execute tasks. It just decides what should run and tells the executor.
-
Executor — the component that takes “this task should run” and actually causes it to run, somewhere. Five common ones:
SequentialExecutor(one at a time, dev only),LocalExecutor(subprocesses on the scheduler box),CeleryExecutor(a worker pool with a Redis/RabbitMQ broker),KubernetesExecutor(one pod per task), and as of Airflow 3 theEdgeExecutor(remote workers calling back over HTTP). -
Worker — a process that actually runs task code. For LocalExecutor, workers are subprocesses of the scheduler. For CeleryExecutor, workers are separate machines running
celery worker. For KubernetesExecutor, “the worker” is an ephemeral pod. -
Triggerer — a separate process (added in 2.2) that runs deferrable operators. When a task says “I’m waiting for an external event, wake me when it happens,” the worker slot is freed and the wait is handed to the triggerer, which uses asyncio to monitor thousands of waits cheaply on a single process. If you have lots of sensors, you need this. You really need this.
-
Webserver / API server — serves the UI and (in Airflow 3) the API that workers use to talk to the metadata database.
-
DAG processor — the process that parses your Python DAG files into DAG objects and writes them, serialized, into the metadata database. In Airflow 2 this was bundled inside the scheduler; in Airflow 3 it’s mandatory and separate.
-
Metadata database — a Postgres or MySQL database that stores everything: DAG definitions (serialized), DAG runs, task instances, XComs, connections, variables, logs of state transitions. The brain. If it goes down, Airflow stops.
Cross-task communication
-
XCom (cross-communication) — a key/value store, scoped to (dag_id, task_id, run_id, key), backed by the metadata database. Tasks push values to XCom; downstream tasks pull them. Limited in size by the database column type (typically 1-2 GB on Postgres, but in practice you should treat the limit as a few KB — see Section 7).
-
Connection — a stored credential/endpoint definition (Postgres URL, AWS keys, Slack token, etc.). Referenced by
conn_id. Stored encrypted in the metadata database (or in a Secrets Backend). -
Variable — a global key/value pair, also stored in the metadata database. Like XCom but not scoped to a run.
-
Hook — a thin Python wrapper around a Connection.
PostgresHook("warehouse_conn")gives you a Postgres client preconfigured with the credentials. Operators are usually just hooks plus task-level wiring.
Reliability
-
Trigger rule — controls when a task should run based on the state of its upstream tasks. Default:
all_success(run if all upstream tasks succeeded). Others:all_done,one_success,one_failed,none_failed,all_failed. Used for fan-in patterns, cleanup tasks, and conditional logic. -
Retries — operators have
retriesandretry_delayparameters. Set them at default-args level and override at task level when needed. -
Pool — a named bucket of concurrency slots. You can put a task in a pool with N slots, and Airflow guarantees no more than N task instances in that pool run simultaneously. Used to throttle access to flaky external systems.
-
Asset (formerly Dataset) — a logical handle to a piece of data (“the orders table,” “the bronze layer for region X”). One DAG declares it produces an asset; another DAG schedules itself to run when that asset updates. This is Airflow’s attempt to support data-aware scheduling, modernized in Airflow 3.
4. The Distilled Introduction
Setup
In production, you don’t install Airflow on your laptop and call it done. But to understand it, do this:
# Use a fresh virtualenv. Airflow has many pinned dependencies.
python -m venv venv && source venv/bin/activate
# The constraint file is mandatory; without it pip will resolve a
# random combination of versions that may not work together.
AIRFLOW_VERSION=3.0.0
PYTHON_VERSION=3.11
CONSTRAINT="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT}"
export AIRFLOW_HOME=~/airflow
airflow standalone # spins up scheduler + webserver + DB, prints admin password
airflow standalone is for trying things out. In production you run each component separately, with a real Postgres, behind systemd or a Helm chart. The point of standalone is to see the pieces work, then immediately stop trusting it for anything beyond exploration.
The key directories:
$AIRFLOW_HOME/dags/— drop your.pyfiles here. The DAG processor scans this folder.$AIRFLOW_HOME/airflow.cfg— main config. You’ll override most of it via env vars in production (AIRFLOW__CORE__EXECUTOR=CeleryExecutor, etc.).$AIRFLOW_HOME/logs/— where task logs go locally.
Your first DAG
# $AIRFLOW_HOME/dags/example.py
from datetime import datetime, timedelta
from airflow.sdk import dag, task # Airflow 3 import; 2.x uses airflow.decorators
@dag(
dag_id="hello_orders",
start_date=datetime(2026, 1, 1),
schedule="@daily",
catchup=False,
default_args={"retries": 2, "retry_delay": timedelta(minutes=5)},
tags=["tutorial"],
)
def hello_orders():
@task
def extract(data_interval_end=None):
# data_interval_end is auto-injected by Airflow.
# For the run with logical_date 2026-01-05, this is 2026-01-06 00:00.
return {"date": data_interval_end.strftime("%Y-%m-%d"), "count": 42}
@task
def transform(record: dict) -> dict:
record["count_doubled"] = record["count"] * 2
return record
@task
def load(record: dict):
print(f"Loaded {record}")
load(transform(extract()))
hello_orders()
Drop this in dags/. Within 30 seconds, the DAG processor parses it and the UI shows it. Toggle the DAG on. The scheduler creates a DAG run for the most recent completed interval, and your tasks run.
A few things just happened that are worth understanding:
- The file is parsed continuously. The DAG processor re-reads
example.pyeverymin_file_process_intervalseconds (default 30). Anything at the top level of the file runs every parse. This is why you keep top-level code light. (Section 7.) catchup=Falsesaved you. Without it, Airflow would have created a DAG run for every day from January 1st 2026 to today and run them all. Withcatchup=False, only the most recent interval runs.- The
@taskdecorator is sugar. Under the hood, Airflow wraps each function in a_PythonDecoratedOperator, treats the return value as an XCom push, and treats the function arguments as XCom pulls. The dependency graph (extract→transform→load) is inferred from the call structure.
The classic (non-TaskFlow) syntax
You’ll see this everywhere in real codebases:
from airflow import DAG
from airflow.providers.standard.operators.bash import BashOperator
from airflow.providers.standard.operators.python import PythonOperator
with DAG(
dag_id="classic_style",
start_date=datetime(2026, 1, 1),
schedule="@daily",
catchup=False,
) as dag:
extract = BashOperator(
task_id="extract",
bash_command="curl -o /tmp/data.json https://api.example.com/orders/{{ ds }}",
)
def _transform(**context):
ds = context["data_interval_end"].strftime("%Y-%m-%d")
# ... do stuff
return ds
transform = PythonOperator(task_id="transform", python_callable=_transform)
extract >> transform # the bitshift operator sets dependencies
>> and << are Airflow’s overloaded operators for “set downstream” and “set upstream.” a >> b means b depends on a. [a, b] >> c means c waits for both a and b. You’ll see these everywhere.
The {{ ds }} and {{ data_interval_end }} syntax is Jinja templating. Airflow renders these at task runtime, replacing them with values for the specific task instance. Many operator fields support templating (check the operator’s template_fields attribute). This is how parameters get the run-specific date without hardcoding it.
The typical workflow
In a real codebase:
- You write DAGs in Python files in a
dags/directory in a Git repo. One DAG per file is the norm. - CI/CD deploys the DAGs to your Airflow environment — typically by syncing the repo to a shared filesystem (NFS, EFS, GCS Fuse) or via
git-syncsidecar in Kubernetes. - The DAG processor sees the new files and parses them within ~30 seconds.
- The scheduler creates DAG runs at the appropriate times.
- The executor runs tasks on workers (your Celery pool, your Kubernetes cluster, etc.).
- You monitor in the UI: Grid view to see runs over time, Graph view to see task structure, Logs to debug, Gantt view to see what was slow.
The most important commands and operations
# Trigger a one-off run, manually, with optional config
airflow dags trigger my_dag --conf '{"some_param": "value"}'
# Test a single task locally without writing to the DB
airflow tasks test my_dag my_task 2026-01-05
# Run a backfill across a range
airflow dags backfill my_dag --start-date 2026-01-01 --end-date 2026-01-10
# Pause/unpause
airflow dags pause my_dag
airflow dags unpause my_dag
# List DAGs / task state
airflow dags list
airflow tasks states-for-dag-run my_dag <run_id>
# Connections — never put credentials in DAG code
airflow connections add postgres_warehouse \
--conn-type postgres \
--conn-host db.internal \
--conn-login app_user \
--conn-password '${WAREHOUSE_PASSWORD}'
airflow tasks test is gold. It runs your task code in your shell, with the full Airflow context, without touching the metadata DB or scheduler. It’s the right way to debug a task locally.
Idioms experienced people use
Set defaults at the DAG level. Don’t repeat retries=2 on every task. Use default_args.
Use Connections, not env vars or hardcoded URLs. PostgresHook("warehouse") resolves at runtime. The connection lives encrypted in the DB or in a Secrets Backend (AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault — configured in airflow.cfg).
Never put data through XCom. Put data on S3/GCS/your warehouse and put the pointer through XCom. (return f"s3://bucket/path/{ds}/data.parquet".) See Section 7.
Use Jinja templating for dates. Don’t compute “yesterday” in Python code at task runtime — let Airflow inject data_interval_start so the task is repeatable for any historical run.
Use data_interval_start and data_interval_end. Avoid ds and execution_date. The ds macro returns the logical date as a string, which is the start of the interval but is named like an execution timestamp. People misuse it constantly. The interval bounds are unambiguous.
One DAG per file. Yes, you can put multiple DAGs in one file. Don’t. Each file is parsed by a single processor, so combining DAGs in one file serializes their parsing.
Tag your DAGs. tags=["finance", "daily"] lets you filter in the UI.
Forward-reference: in Section 5 we’ll see why all of these idioms aren’t style preferences — they’re consequences of how the scheduler works.
5. The Mental Model
Four ideas that make Airflow’s behavior predictable.
Core Idea 1: A DAG file is a Python program that runs, on a schedule, in the scheduler’s process.
This is the single most important fact about Airflow and the source of most surprises.
When the scheduler (or DAG processor in Airflow 3) parses your DAG file, it imports the file as a Python module. Top-level code executes. Every time the file is parsed. By default, that’s every 30 seconds. If you have 500 DAG files, that’s 500 imports happening continuously, on a schedule, on the same machine that’s trying to schedule tasks.
This predicts:
- Top-level
requests.get()will hammer your API every 30 seconds. If you call an external API at the top of a DAG file to “configure” the DAG, you’ve built an unintentional load test against that API. - Top-level
Variable.get()queries the metadata DB on every parse. Airflow Variables are stored in the database. CallingVariable.get("foo")at the top level means an extra DB query per DAG file per parse cycle. Across hundreds of DAGs, this saturates connections. - Heavy imports slow everything down.
import tensorflowat the top of a DAG file means tensorflow is imported every parse. The DAG file should import lightly; heavy imports go inside the task function, where they only run when the task runs. - Random values at the top level cause “DAG version inflation.” If you put
datetime.now()oruuid.uuid4()in a DAG or task constructor, the serialized DAG looks different on every parse, and Airflow records a new DAG version every time. Cumulatively, this can add millions of rows to your metadata DB.
The rule: the body of the DAG file should build the DAG object and nothing else. No I/O. No DB queries. No expensive imports. No clever tricks that read external state. Build the structure; defer all real work to inside the task functions, which only execute when the scheduler tells a worker to run them.
Core Idea 2: Tasks are defined statically, but executed dynamically.
A DAG, after parsing, is a static graph of tasks and dependencies. A task is abstract — “this is the transform task.” It doesn’t know what date it’s running for. It doesn’t know what the previous run did. It is a template.
A task instance is the concrete thing: “the transform task for the DAG run with logical date 2026-01-05.” This is what actually executes. State lives at the task-instance level, not the task level. When you “clear a task” in the UI, you’re clearing one or more task instances, not the task itself.
This predicts:
- Idempotency is your problem. Airflow will rerun a task instance any time you clear it. If your task does
INSERT INTO daily_metrics ...instead ofDELETE WHERE date = {{ ds }}; INSERT ..., you’ll get duplicates every time. Airflow does not enforce idempotency. Airflow assumes idempotency. - Tasks should not share state via local files. Two task instances of the same task might run on different workers. The disk state of one is invisible to the other. Use object storage or the database.
- Backfill is just “create a bunch of task instances.” When you backfill, Airflow creates task instances for the historical dates and runs them through the same machinery. This is why backfill works at all — there’s no special “backfill mode,” just task instances getting scheduled.
Core Idea 3: The scheduler trusts the database, not its own memory.
Airflow’s scheduler is more or less stateless. State lives in the metadata database. The scheduler’s job is to keep asking the database: “given the DAGs I know about, the runs that have happened, the task states, and the schedule — what should run next?” Then it writes its decisions back to the database, and the executor reads them.
This is why you can run multiple schedulers (Airflow 2.0+) for high availability — they all read and write the same database, and use row-level locking to coordinate.
This predicts:
- The metadata DB is the bottleneck. As you scale, you’ll hit connection limits before you hit CPU limits on the scheduler. PgBouncer is mandatory for any serious Postgres-backed deployment.
- Stopping the scheduler doesn’t stop tasks. Tasks already dispatched to workers keep running. The scheduler just stops creating new ones.
- You can rebuild a scheduler from nothing. Lose your scheduler box; a new one comes up, queries the DB, and continues where the old one left off. Lose the database; you’ve lost everything.
- Race conditions you’d expect in a distributed system show up here. Two schedulers, both deciding which task to dispatch — Airflow handles this correctly via SELECT FOR UPDATE, but understanding that the DB is the synchronization point helps you reason about edge cases.
Core Idea 4: The data interval ends; then the DAG runs.
A daily DAG with logical date 2026-01-04 has data interval [2026-01-04, 2026-01-05), and runs at the end of the interval — shortly after midnight on the 5th.
The reasoning: a daily ETL is processing the data for January 4th. That data isn’t fully available until January 4th is over. So the run executes after the interval closes.
This predicts:
- A DAG with
start_date=datetime(2026, 1, 1)andschedule="@daily"does not run on January 1st. It runs at the end of the first interval — January 2nd. New users always get this wrong. logical_date≠ “when the task ran.” It’s “the start of the interval this run represents.” If you want “today’s date” in aLIVEtask, you actually wantdata_interval_end, notds. The Medium articles screaming about “yesterday’s data” are all about this.- Manual triggers don’t fit cleanly into the interval model. When you press “Trigger DAG” in the UI, the data interval is computed somewhat ad-hoc, and
logical_datemay not matchdata_interval_startfor cron-like timetables. This is a known wart.
When in doubt: use data_interval_start for “the beginning of the period I’m processing” and data_interval_end for “the end of the period I’m processing.” Stop using ds, execution_date, and logical_date in production code unless you specifically mean those things.
6. The Architecture in Plain English
Let’s walk through what happens when you deploy a DAG and a task runs.
You drop my_dag.py into the dags/ folder. The folder is mounted (or git-synced) to every Airflow component.
The DAG processor wakes up periodically. It scans dags/ for changed files. For each one, it spawns a subprocess that imports the file as a Python module. The subprocess collects all DAG objects defined in the module’s globals, validates them (no cycles, valid schedule, etc.), serializes them to JSON, and writes them to the serialized_dag table in the metadata database. The subprocess dies when done. (Subprocess isolation matters — a DAG file that segfaults Python doesn’t take down the processor.)
The scheduler runs in a continuous loop. Every few seconds it does roughly this:
-
Create DAG runs. For each DAG, look at its schedule, the latest existing DAG run, and the current time. If a new interval has elapsed and no DAG run exists for it, create one (and create task instances for all its tasks, in
scheduledstate). -
Find runnable task instances. Look for task instances in
scheduledstate whose upstream dependencies are all met (according to their trigger rule) and whose pool/concurrency limits aren’t exceeded. Move them toqueuedstate. -
Dispatch queued tasks. Hand the queued task instances to the executor. The executor’s
execute_async()is called.
The executor picks them up. What happens here depends on which executor:
-
LocalExecutor spawns a subprocess on the scheduler box that runs
airflow tasks run <dag_id> <task_id> <run_id>. This subprocess loads the serialized DAG, instantiates the operator, calls itsexecute()method. -
CeleryExecutor publishes a message to Redis or RabbitMQ saying “run this task.” A separate
celery workerprocess, on a different machine, picks up the message and runsairflow tasks run. The number of concurrent tasks is controlled by the number of Celery workers and their concurrency settings. -
KubernetesExecutor asks the Kubernetes API to create a pod with a specific image and command (
airflow tasks run ...). The pod starts, runs the task, writes logs, and terminates. One pod per task instance.
Inside whichever process is actually running the task, the operator’s execute() method is invoked. The task’s logs go to local disk (and optionally to S3/GCS/Elasticsearch via the configured remote logging). The task’s return value is pushed to XCom (if do_xcom_push=True). The state transitions are reported back: running → success or failed.
In Airflow 3, task code does not talk directly to the metadata database anymore. It talks to the API server (a FastAPI service) over HTTP, with JWT-authenticated requests. The API server brokers everything: XCom reads/writes, variable lookups, state updates, log writes. This isolates user code from the database, enables remote workers (the Edge Executor), and prevents user code from mutating Airflow internals — a major security upgrade.
Meanwhile the triggerer is busy with deferrable operators. When a deferrable operator says “I need to wait for X” via self.defer(...), the worker process exits cleanly (releasing its slot). The trigger object is serialized into the DB. The triggerer process picks up trigger objects and runs their async run() methods inside an asyncio event loop. One triggerer process can hold thousands of waits simultaneously — far more than worker slots could. When the trigger fires, the operator is resumed: a worker is selected, the operator’s continuation method is called with the trigger’s payload, and execution proceeds.
The webserver (Airflow 2) or API server + React UI (Airflow 3) reads the metadata DB to render the UI. Grid view, Graph view, logs — all DB queries. The UI doesn’t directly control execution; it just sets values in the DB (e.g., “clear this task instance,” “pause this DAG”) that the scheduler reacts to on its next loop.
Where state lives — and this is the punchline:
- DAG structure: serialized JSON in the metadata DB.
- DAG run state, task instance state: rows in the metadata DB.
- XComs: rows in the metadata DB (or, with a custom backend, an external store).
- Connections, Variables: encrypted rows in the metadata DB (or in the configured Secrets Backend).
- Logs: local disk, or S3/GCS/Elasticsearch.
- Task code state during execution: in-process memory of the worker, gone when the task ends.
- Heartbeats: rows updated periodically in the DB by each component.
The metadata DB is the universe. Everything else is a process that reads from it, writes to it, or shovels work between it and somewhere else.
7. The Things That Bite You
Each of these is a real production scar.
7.1 Top-level code runs every 30 seconds
What you’d expect: the DAG file runs once at deploy time.
What actually happens: Airflow’s DAG processor parses the file every min_file_process_interval (default 30s) and re-imports it. Top-level code runs each time. (See Mental Model 1.)
Why this is brutal: put requests.get(...) at the top of a DAG file to “fetch config” and you’ve built a continuous loadgen against that API. Put Variable.get("config") at the top and you’ll add 2-3 DB queries per parse cycle, which over hundreds of DAGs takes down your metadata DB. The Cloud Composer team has documented cases where this slows DAG parse times from milliseconds to a minute, which then causes scheduler lag, which then causes missed schedules.
How to handle it: keep the DAG file body to imports + DAG/task definitions only. Move all I/O, DB queries, network calls, and heavy computation into task callables (which run only when the task runs, not when the file is parsed). Use Jinja templates ({{ var.value.my_var }}) instead of Variable.get() at the top level — Jinja resolves at task runtime.
7.2 execution_date ≠ when the task executed
What you’d expect: “execution date” is when the DAG actually ran. Looking at execution_date in your task should give you “now.”
What actually happens: execution_date (renamed logical_date in 2.2+) is the start of the data interval. For a daily DAG, that’s the day before the run actually executes. (See Mental Model 4.)
The bug it causes: you write dataset_path = f"data/{ds}.parquet" thinking this gives you today’s file. For the run that actually executes on January 5th, ds is 2026-01-04 — yesterday. You’re writing yesterday’s date as today’s filename and silently processing the wrong window. People discover this weeks later when reports look stale.
How to handle it: use data_interval_start and data_interval_end explicitly. They’re unambiguous. If you really want “now,” use data_interval_end (the end of the interval is when the run executes). Mentally retire ds and execution_date from your vocabulary.
7.3 start_date in the past + catchup=True = chaos
What you’d expect: I deployed a DAG with start_date=datetime(2024, 1, 1), so it’ll start running today.
What actually happens: Airflow creates a DAG run for every missed interval between start_date and now and tries to run them all. With @hourly and a year-old start_date, that’s 8,760 backfill runs queued instantly. Your warehouse falls over. Your on-call gets paged at 3am.
Why: catchup=True is the legacy default. The scheduler treats any unfilled interval as something to fill.
How to handle it: set catchup=False on every DAG. Make it a default in your team’s DAG factory. Use explicit airflow dags backfill commands when you actually want to fill history. Set max_active_runs=1 while you’re at it, so a slow run doesn’t allow ten runs to pile up in parallel.
7.4 Naive sensors hog worker slots
What you’d expect: a sensor waits for an event. Cheap. Lightweight.
What actually happens: the default sensor mode is poke, which means the sensor occupies a full worker slot, sleeps for poke_interval seconds, wakes up, checks the condition, sleeps again. A sensor waiting 4 hours for a file to appear consumes one worker slot for 4 hours. Run 50 such sensors and your 64-slot Celery cluster has 50 workers stuck doing nothing, while your real ETL tasks queue.
Why: poke is a synchronous loop in the worker process. By design, it holds the slot for its whole runtime.
How to handle it: for any sensor that might wait more than a couple of minutes, switch to mode='reschedule' (the sensor re-queues itself between checks, freeing the slot) or to a deferrable=True variant (the wait is handed to the triggerer’s asyncio loop, freeing the slot entirely). Deferrable is strictly better when available — reschedule still creates a new task instance row per check, which can balloon the metadata DB. With deferrable, one triggerer can hold tens of thousands of waits on one CPU.
7.5 XCom is not a data bus
What you’d expect: I can return my pandas DataFrame from one task and have the next task pick it up. That’s what XCom is for, right?
What actually happens: XCom values are stored as rows in the metadata DB. Send a 50MB DataFrame and you’ve now got a 50MB row in your Postgres. Do this on every run for many DAGs and the DB fills up; queries slow down; the scheduler lags; everything degrades. Sometimes Airflow happily lets you push a value that exceeds the DB’s column size limit and you get cryptic serialization errors.
Why: XCom was designed for small coordination signals — a row count, a file path, a status flag, a list of IDs. It was not designed to be a data pipeline.
How to handle it: put the data in object storage (S3/GCS/Azure Blob), and put only the path through XCom. The downstream task reads the path from XCom and pulls the data from object storage. If you genuinely want to pass medium-sized objects through your DAG pipeline plumbing, configure a custom XCom backend that writes XCom values to S3 transparently — Airflow supports this out of the box. But the rule remains: XComs are pointers, not payloads.
7.6 Sharing data via local files between tasks doesn’t work
What you’d expect: task A writes /tmp/out.json, task B reads /tmp/out.json. Same machine, right?
What actually happens: with anything other than LocalExecutor or SequentialExecutor, task A might run on worker-1 and task B on worker-7. The file doesn’t exist on worker-7. Task B fails with FileNotFoundError, but only sometimes — when both tasks happen to land on the same worker, it works, which is worse than always failing because it makes the bug intermittent.
How to handle it: treat task boundaries as machine boundaries even when they aren’t. Read inputs from and write outputs to a system that’s reachable from anywhere — object storage, the warehouse, a database. Local disk is scratch space within a task, not a communication channel between tasks.
7.7 Changing start_date or schedule after deployment
What you’d expect: I’ll just edit the DAG file and push. It’s just code.
What actually happens: the scheduler’s understanding of “what intervals should have runs” depends on start_date and the schedule. When you change them, the scheduler may either decide that some past intervals were never run (and try to backfill them), or skip intervals it would have run, or get confused about which DAG runs map to which intervals. Existing DAG runs don’t migrate cleanly.
How to handle it: treat start_date as an immutable property of a DAG. If you really need to change scheduling, change the dag_id. my_pipeline_v2 is ugly but unambiguous. The old DAG history is preserved, the new DAG runs cleanly from its new start_date, and you don’t get gaslit by half-correct scheduling.
7.8 Trigger rules silently change DAG-run status
What you’d expect: if any task in a DAG fails, the DAG run is marked failed.
What actually happens: if your final task has trigger_rule='all_done' (i.e., “run me regardless of upstream success”) and it succeeds, the DAG run can be marked success even though earlier tasks failed. The “leaf” task’s success determines the overall run state. Your alerting fires on “failed runs” — and these silently slip through.
How to handle it: if you have cleanup or notification tasks with non-default trigger rules, add a “watcher” task whose only job is to fail the DAG run when any upstream task fails. The Airflow community tests use this exact pattern. Or audit your DAG run states using task_instance state, not DAG run state, when something seems off.
7.9 The Postgres connection storm
What you’d expect: my Airflow scheduler is one process; one process = a few connections to Postgres.
What actually happens: Airflow scheduler runs many subprocesses (file processors, executor handlers, etc.), each of which opens its own DB connection pool. A medium-sized deployment can easily open 200+ connections. Postgres’ default max_connections is 100. You hit the limit; new connections fail; the scheduler crashes.
How to handle it: put PgBouncer in front of Postgres for any non-trivial deployment. The Helm chart bundles PgBouncer for exactly this reason. Without it, you’ll spend a weekend chasing “FATAL: too many connections” messages.
7.10 The “DAG works but doesn’t show up in the UI” trap
What you’d expect: I dropped a file in dags/, the UI should show it.
What actually happens: the UI shows DAGs that have been parsed and whose serialized form is in the DB. If the DAG file has an import error, a syntax error, or takes longer than dagbag_import_timeout to parse, parsing fails silently. The DAG never appears, with no obvious error in the UI.
How to handle it: check the DAG processor logs (airflow dag-processor logs or the UI’s “Import Errors” page). Better, run python my_dag.py locally — if it doesn’t error and prints nothing, the file is parseable. If parsing succeeds but is slow, time it: time python my_dag.py. Aim for sub-second parse times. If your DAG takes 10+ seconds to parse, the scheduler is losing real cycles to your file alone.
8. The Judgment Calls
Where “knowing Airflow” graduates to “having taste.”
8.1 Which executor do you actually pick?
The four real choices:
-
LocalExecutor — runs tasks as subprocesses on the scheduler box. Best for: small deployments, dev environments, single-machine production where total daily task count is low and tasks are quick. Fast startup, simple ops, no extra services.
-
CeleryExecutor — workers pull tasks from a Redis/RabbitMQ queue. Best for: most medium-to-large deployments running many short-to-medium tasks per day. Mature, well-understood, fast task startup (~1s). The default for serious self-hosted Airflow.
-
KubernetesExecutor — each task runs in its own pod. Best for: heterogeneous workloads where different tasks need different images/resources/secrets. Strong isolation. But pod startup is 5-30 seconds, so it’s a poor fit for many fast tasks.
-
Hybrid (Celery + Kubernetes) — common in mature deployments. Default to Celery for fast tasks; use
KubernetesPodOperator(or per-task executor selection in Airflow 2.10+) for heavy/isolated tasks.
The signal: count your task duration distribution. Many tasks under 1 minute? Celery. Few tasks over 10 minutes that need different runtime environments? Kubernetes. Mostly fast tasks but a handful that need GPUs or specific images? Hybrid. Don’t pick KubernetesExecutor for everything just because containers feel modern — the overhead will hurt you.
8.2 TaskFlow API or classic operators?
-
TaskFlow (
@taskdecorators) is great for: pipelines that are largely Python, where return values flow naturally between tasks. Cleaner code, less boilerplate, automatic XCom plumbing. -
Classic operators are great for: orchestrating non-Python systems (
PostgresOperator,KubernetesPodOperator,DbtRunOperator), and for cases where the task is fundamentally about calling an external system. The operator abstraction was designed for this.
Don’t force TaskFlow on a task whose real job is “run a SQL query in Snowflake.” Use the appropriate operator. Mix freely. Most production DAGs have both.
8.3 Catchup on or off?
Off, by default. Always. Turn catchup on only when:
- Your tasks are fully idempotent across history (delete-then-insert by date, or merge/upsert keyed by date).
- You’re confident the warehouse can handle N parallel historical runs.
- You actually want history filled in automatically.
If you want to backfill, use airflow dags backfill explicitly — you control the date range, the parallelism, the rerun-failed flag. Catchup is the loaded shotgun pointed at a beginner engineer’s first deploy.
8.4 Sensors vs. external triggering
A sensor that waits for a file is convenient but expensive: the sensor task occupies a slot (or DB row, or triggerer attention) for the entire wait. An alternative: the upstream system triggers the downstream Airflow DAG when the file arrives — via the REST API, an Asset, or an event-driven schedule.
The signal: how long is the wait, and how many sensors do you have? A 5-minute deferrable sensor waiting for a daily file? Fine. 200 sensors all polling for hours? Move to triggered DAGs (Assets in modern Airflow, or external API triggers). If you have a triggerer running, deferrable sensors are usually fine up into the thousands; without the triggerer, every poke is a worker slot or a DB row.
8.5 One big DAG or many small ones?
A monolithic DAG with 200 tasks is hard to operate (one slow task delays everything; reruns are clumsy; the UI gets sluggish; XCom volume balloons). Multiple smaller DAGs are easier to reason about, but you have to coordinate them via Assets, ExternalTaskSensor, TriggerDagRunOperator, or external triggers.
The signal: ask “what’s the unit of independent re-runnability?” If steps 1-50 always need to be redone together when something fails, they belong in one DAG. If step 25 produces a stable artifact that step 26 onwards can reliably consume independently, that’s a natural cut point — split the DAG, link them with Assets, and let each be reran independently.
8.6 Dynamic task mapping vs. a loop in Python
Modern Airflow supports dynamic task mapping (partial(...).expand(arg=[1, 2, 3])), creating a task instance per element at runtime. This gives you per-element retries and per-element logs in the UI.
The alternative is a single task that loops over the elements internally.
The signal:
- Many elements (>1000), each cheap (sub-second)? Don’t map — the scheduler overhead per task instance dominates. Loop inside one task. Airflow 3.3+ adds dynamic task iteration for this exact case.
- Modest number of elements (10-100), each slow or independently failable (>5s, calling external systems)? Map. You get per-element observability and individual retries.
- Tens of thousands of elements? Neither. You’re using Airflow as a data processing engine. Push the work to Spark/Beam/dbt and have Airflow orchestrate the job.
8.7 Where do credentials live?
- Worst: hardcoded in the DAG file. Always wrong.
- Bad: stored as Airflow Variables. Better than hardcoded, but variables are visible to anyone with UI access.
- Better: Airflow Connections, stored encrypted in the metadata DB. Scoped, typed, easy to rotate.
- Best: a Secrets Backend (AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault) configured via
airflow.cfg. Connections and Variables are looked up from the backend at runtime; the DB never sees them. Centralized rotation, audit logging, fine-grained access control.
For any production deployment touching customer data or third-party APIs, use a Secrets Backend. The added complexity is small; the security upgrade is large.
8.8 When NOT to use Airflow
An engineer’s most valuable skill: knowing when to reach for something else. Airflow is not the answer when:
-
Your workload is real-time / streaming. Airflow is batch. Tasks have schedules, not subscriptions. Use Kafka + Flink / Spark Streaming / Materialize.
-
You have a small, stable set of jobs and a tiny team. A managed scheduler (GitHub Actions cron, Cloud Scheduler + Cloud Run, AWS Step Functions) might be cheaper to operate than maintaining an Airflow deployment. The Airflow ops cost is real.
-
Your pipeline is fundamentally data-asset-centric. If you spend most of your day reasoning about “is the orders table fresh?” rather than “did the orders job run?”, Dagster’s asset-first model fits your mental model better. Airflow has Assets now (renamed from Datasets), but the core abstraction is still tasks.
-
You need rapid local development and testing. Airflow’s “scheduler + DB + worker + UI” minimum stack makes iteration painful compared to Prefect’s “run a flow as a Python function locally” model.
-
You’re orchestrating only dbt. dbt has its own DAG. Wrapping dbt in Airflow is common but often adds more ceremony than value for small teams. dbt Cloud or a thin scheduler may be enough.
That said: when your problem is “I have many heterogeneous batch workloads with complex dependencies, and I need a battle-tested system to manage them across years of evolution,” Airflow is still hard to beat. It’s the boring, reliable, vast-ecosystem choice.
8.9 Airflow 2 or Airflow 3?
If you’re starting fresh in 2026: Airflow 3. The Task SDK, separated DAG processor, API-mediated database access, real DAG versioning, modern React UI, and Asset partitioning are real improvements. Airflow 2.x enters limited support in April 2026 — security patches only.
If you have a large Airflow 2 deployment: the migration is real work. Direct DB access from tasks is gone (any custom operator using sessions needs refactoring). Operators moved to provider packages. The webserver split into API server + DAG processor. Plan weeks, not days. The Ruff AIR301/AIR302 rules will help find breaking patterns.
9. The Commands and APIs That Actually Matter
The 80/20 reference, grouped by what you’re trying to do.
Defining DAGs and tasks
from airflow.sdk import dag, task # Airflow 3 (use airflow.decorators in 2.x)
from airflow.providers.standard.operators.bash import BashOperator
from airflow.providers.postgres.operators.postgres import PostgresOperator
from airflow.providers.amazon.aws.operators.s3 import S3CopyObjectOperator
@dag(
schedule="@daily",
start_date=pendulum.datetime(2026, 1, 1, tz="UTC"),
catchup=False,
max_active_runs=1, # don't let a slow run stack up
default_args={"retries": 3, "retry_delay": timedelta(minutes=5)},
tags=["finance", "daily"],
)
def my_pipeline(): ...
The flags that matter from day one:
catchup=False— almost always.max_active_runs=1— prevents pile-up if a run is slow.default_args={"retries": 3}— flaky world; tasks should retry.tags=[...]— UI filtering at scale.start_datewith explicit timezone (usependulum, notdatetime— Airflow’s preference).
The Jinja templates you’ll use weekly
{{ data_interval_start }} # datetime, start of interval — USE THIS
{{ data_interval_end }} # datetime, end of interval — AND THIS
{{ ds }} # logical_date as YYYY-MM-DD string (start of interval)
{{ ts }} # logical_date as ISO timestamp
{{ run_id }} # unique ID for this DAG run
{{ task_instance_key_str }} # dag_id__task_id__YYYYMMDD
{{ var.value.my_variable }} # Airflow Variable, resolved at runtime (preferred over Variable.get)
{{ var.json.my_variable }} # parse Variable as JSON
{{ conn.my_conn.password }} # access Connection fields (use sparingly)
{{ macros.ds_add(ds, 7) }} # date arithmetic
{{ var.value.foo }} and {{ conn.bar.password }} are the templated equivalents of Variable.get and BaseHook.get_connection. They’re safe at the operator field level because they resolve at task runtime, not parse time.
Setting up dependencies
a >> b # b depends on a
a << b # a depends on b
[a, b] >> c # c depends on both a and b (fan-in)
a >> [b, c] # b and c both depend on a (fan-out)
chain(a, b, c, d) # equivalent to a >> b >> c >> d, more readable in long chains
cross_downstream([a, b], [c, d]) # every task in left set is upstream of every task in right set
For TaskFlow: dependencies are inferred from the call graph. load(transform(extract())) builds extract → transform → load automatically.
CLI commands you’ll actually use
airflow tasks test <dag_id> <task_id> <execution_date> # debug a task locally
airflow dags trigger <dag_id> --conf '{"k": "v"}' # manual trigger
airflow dags backfill <dag_id> --start-date X --end-date Y # explicit historical run
airflow dags pause <dag_id>
airflow dags unpause <dag_id>
airflow dags list-runs --dag-id <dag_id>
airflow tasks states-for-dag-run <dag_id> <run_id>
airflow connections add / get / list / delete
airflow variables set / get / list / delete
airflow db check # sanity-check the metadata DB
airflow info # full env dump for support tickets
airflow tasks test is the most underused command. It runs your task code in your shell with the full Airflow context, without writing to the DB or scheduler. The right way to debug.
Configuration knobs that matter
In airflow.cfg or as AIRFLOW__SECTION__KEY env vars:
[core]
executor = CeleryExecutor # or KubernetesExecutor / LocalExecutor
parallelism = 32 # max concurrent TIs across whole instance
max_active_tasks_per_dag = 16
max_active_runs_per_dag = 1 # safe default
[scheduler]
min_file_process_interval = 30 # raise to 60-120 if DAG count is high
parsing_processes = 4 # tune to CPU
dag_dir_list_interval = 300 # how often to look for new DAG files
[database]
sql_alchemy_conn = postgresql+psycopg2://...
sql_alchemy_pool_size = 5
sql_alchemy_max_overflow = 10
[secrets]
backend = airflow.providers.amazon.aws.secrets.secrets_manager.SecretsManagerBackend
backend_kwargs = {"connections_prefix": "airflow/connections"}
XCom
# Push (auto)
@task
def producer():
return {"path": "s3://bucket/file.parquet"} # auto-pushed as XCom 'return_value'
# Pull (auto via TaskFlow)
@task
def consumer(payload: dict): # payload is the XCom from producer
return payload["path"]
# Manual push/pull (classic style)
def callable_(**ctx):
ctx["ti"].xcom_push(key="my_key", value=42)
val = ctx["ti"].xcom_pull(task_ids="other_task", key="my_key")
Rule: pointers, not payloads.
Trigger rules
@task(trigger_rule="all_success") # default: run if all upstream succeeded
@task(trigger_rule="all_done") # run regardless of upstream state (cleanup, notifications)
@task(trigger_rule="one_failed") # run if any upstream failed (alerting)
@task(trigger_rule="none_failed") # run if no upstream failed (allows skipped)
@task(trigger_rule="all_skipped") # run if every upstream was skipped
Most useful for fan-in cleanup tasks (all_done) and conditional alerting (one_failed).
10. Writing Sensors
A sensor is just an operator whose execute() loops, calling a poke() method until it returns truthy or times out. That’s the entire abstraction. Once you see it, sensors stop being mysterious — they’re 30 lines of code with a strong convention.
The convention is BaseSensorOperator. You subclass it, override poke(), return True when the condition is met, return False (or a PokeReturnValue) when it isn’t. The base class handles the loop, the timeout, and the mode (poke vs reschedule).
from airflow.sdk.bases.sensor import BaseSensorOperator # Airflow 3 path
from airflow.utils.context import Context
class WarehousePartitionSensor(BaseSensorOperator):
"""Wait for a date partition to appear in the analytics warehouse."""
template_fields = ("table", "partition_date")
def __init__(self, *, table: str, partition_date: str, conn_id: str = "warehouse",
**kwargs):
super().__init__(**kwargs)
self.table = table
self.partition_date = partition_date
self.conn_id = conn_id
def poke(self, context: Context) -> bool:
from my_lib.warehouse import WarehouseHook # heavy import deferred
hook = WarehouseHook(self.conn_id)
count = hook.get_first(
f"SELECT count(*) FROM {self.table} WHERE dt = %s",
parameters=(self.partition_date,),
)[0]
self.log.info("Partition %s in %s has %d rows", self.partition_date, self.table, count)
return count > 0
Used like this:
wait_for_orders = WarehousePartitionSensor(
task_id="wait_for_orders",
table="raw.orders",
partition_date="{{ data_interval_start | ds }}", # Jinja templated
poke_interval=300, # check every 5 min
timeout=60 * 60 * 6, # give up after 6 hours
mode="reschedule", # FREE THE SLOT
soft_fail=False, # fail the DAG on timeout
)
That’s 95% of all sensors people write. Things to internalize about this template:
template_fields is mandatory if you want Jinja. If partition_date isn’t in template_fields, the string "{{ data_interval_start | ds }}" is passed to poke() literally — Airflow will only render the templates of declared fields. Forget this and you’ll spend 20 minutes debugging why your sensor is querying for partition '{{ ds }}'.
Defer heavy imports into poke(). The DAG file imports the sensor class at parse time; if your __init__ or top-level imports drag in a 200MB warehouse driver, every parse pays for it. (Mental Model 1.)
poke() should be cheap and fast. The point of a sensor is “is the condition met right now?” If the answer takes 30 seconds to compute, you’re misusing the abstraction — that’s a regular task, not a sensor. A poke that times out also pretends the condition is unmet, leading to silent retry storms.
Choose your mode deliberately. poke (default) keeps a worker slot for the entire wait — fine for sub-minute waits, catastrophic for hour-scale waits. reschedule releases the slot between checks but creates a task_reschedule row per check. With poke_interval=60 and a 6-hour wait, that’s 360 DB rows per sensor per run. With 50 such sensors per day, you’re adding ~18,000 rows daily to the metadata DB. That’s manageable, but it adds up.
Writing the deferrable variant
A deferrable sensor releases its worker slot and avoids the per-check DB rows by handing the wait to the triggerer. Triggers run in an asyncio loop; one triggerer process holds thousands of waits on a single CPU. If you have many sensors that wait minutes-to-hours, write deferrable variants. The pattern:
import asyncio
from typing import AsyncIterator
from airflow.triggers.base import BaseTrigger, TriggerEvent
class WarehousePartitionTrigger(BaseTrigger):
"""The 'wait' half — runs in the triggerer's asyncio loop."""
def __init__(self, table: str, partition_date: str, conn_id: str, poll_seconds: int = 300):
super().__init__()
self.table = table
self.partition_date = partition_date
self.conn_id = conn_id
self.poll_seconds = poll_seconds
def serialize(self) -> tuple[str, dict]:
# Triggers must be JSON-serializable so Airflow can persist & resume them.
return (
"my_pkg.triggers.WarehousePartitionTrigger",
{"table": self.table, "partition_date": self.partition_date,
"conn_id": self.conn_id, "poll_seconds": self.poll_seconds},
)
async def run(self) -> AsyncIterator[TriggerEvent]:
from my_lib.warehouse import AsyncWarehouseHook # MUST be async-capable
hook = AsyncWarehouseHook(self.conn_id)
while True:
count = await hook.fetch_count(self.table, self.partition_date)
if count > 0:
yield TriggerEvent({"status": "ready", "count": count})
return
await asyncio.sleep(self.poll_seconds)
class WarehousePartitionSensorAsync(BaseSensorOperator):
"""The 'operator' half — defers immediately and resumes on the trigger event."""
template_fields = ("table", "partition_date")
def __init__(self, *, table, partition_date, conn_id="warehouse",
poll_seconds: int = 300, **kwargs):
super().__init__(**kwargs)
self.table = table
self.partition_date = partition_date
self.conn_id = conn_id
self.poll_seconds = poll_seconds
def execute(self, context: Context):
# Note: we don't override poke(). We override execute() and defer immediately.
self.defer(
trigger=WarehousePartitionTrigger(
self.table, self.partition_date, self.conn_id, self.poll_seconds,
),
method_name="execute_complete",
)
def execute_complete(self, context, event):
if event["status"] != "ready":
raise AirflowException(f"Sensor failed: {event}")
self.log.info("Partition ready with %d rows", event["count"])
return event["count"]
What’s important here:
The trigger must use async/await, not blocking I/O. The whole point is that one event loop holds many waits. A time.sleep(60) in the trigger blocks every other trigger on the same triggerer. A requests.get() (sync HTTP) does the same. You need an async-capable client (aiohttp, asyncpg, etc.) — this is the main reason deferrable variants are harder to write than poke() sensors.
serialize() must round-trip. When the operator defers, the trigger gets pickled into the DB. When the triggerer picks it up — possibly minutes later, possibly on a different host, possibly after a triggerer restart — it deserializes via the path returned by serialize(). The serialized form must contain everything the trigger needs to resume; any in-memory state is gone.
Triggers can run in multiple places at once. Airflow has HA for triggerers; if a triggerer dies, its triggers are picked up elsewhere. Briefly, both might be running. Airflow de-duplicates the resulting events, but your trigger code must be safe to run concurrently against the same external system. Don’t, for example, use a trigger to consume a queue message — use it to observe state.
You don’t have to defer immediately. A common pattern: do one synchronous check first (cheap, often answers immediately), and only defer if the condition isn’t met. This avoids the trigger-DB round-trip when the partition already exists.
Sensor anti-patterns to refuse
Sensors that do work. A sensor’s job is to check a condition, not produce an output. If your sensor is also fetching the file it’s waiting for, split it: a sensor task that waits, then a separate operator task that fetches. This keeps poke() cheap and gives you accurate UI status (the wait task is yellow until the file appears; the fetch task is then green).
Sensors with poke_interval=1. Polling once per second hits external systems hard for marginal latency benefit. Most data systems update at minute-or-greater granularity; pick a poke_interval that matches reality (60-300s is typical).
Sensors with mode='poke' and no timeout. A poke-mode sensor with the default timeout (1 week) that fails to detect its condition will sit on a worker slot for a week. Always set a reasonable timeout. A reasonable timeout for “wait for today’s daily partition” is not 24 hours — it’s “the SLA past which this DAG is failed.” If the partition doesn’t show up by 6 hours past schedule, page someone; don’t quietly wait another 18 hours.
Sensors as cross-DAG dependencies. Using ExternalTaskSensor to wait for another DAG’s task is a common pattern but a fragile one — small changes to the upstream schedule break it. Prefer Assets (data-aware scheduling): the upstream DAG declares outlets=[Asset("orders_table")], the downstream DAG schedules on schedule=[Asset("orders_table")], and Airflow handles the dependency natively. The runtime semantics are clearer, the UI shows the dependency explicitly, and there’s no sensor to misconfigure.
11. Writing Custom Operators
You will be tempted to write a custom operator before you should. The community has built operators for almost everything: SQL warehouses, S3, GCS, Slack, Jira, dbt, Spark, Kubernetes pods, HTTP. Before writing one, search the providers.
When you genuinely should write one:
- You’re calling an internal system that no provider covers.
- You’re encapsulating a complex multi-step interaction with an external system that you’ll use across many DAGs.
- You’re standardizing a pattern (e.g., “every team writes the same idempotent warehouse-load logic”) and want the standardization enforced via a single class.
When you should not write one:
- For a one-off Python function. Use
@task. ThePythonOperator/TaskFlow exists for this. - To wrap a single SQL statement that an existing operator already handles. Configure the existing operator instead.
- To “make my DAG cleaner.” Custom operators are infrastructure code. They have to be tested, versioned, and maintained. The bar for paying that cost is “this logic is reused across many DAGs and would otherwise be copy-pasted.”
The BaseOperator contract
from airflow.sdk.bases.operator import BaseOperator # Airflow 3
from airflow.utils.context import Context
class WarehouseLoadOperator(BaseOperator):
"""Idempotently load a parquet file into a date-partitioned warehouse table."""
template_fields = ("source_path", "target_table", "partition_date")
template_ext = (".sql",) # if any field references a .sql file, render it
ui_color = "#bcd9e8" # purely cosmetic
def __init__(
self,
*,
source_path: str,
target_table: str,
partition_date: str,
conn_id: str = "warehouse",
partition_column: str = "dt",
**kwargs,
):
super().__init__(**kwargs)
self.source_path = source_path
self.target_table = target_table
self.partition_date = partition_date
self.conn_id = conn_id
self.partition_column = partition_column
def execute(self, context: Context):
# Lazy imports keep DAG parse fast.
from my_lib.warehouse import WarehouseHook
hook = WarehouseHook(self.conn_id)
self.log.info(
"Replacing partition %s=%s in %s from %s",
self.partition_column, self.partition_date,
self.target_table, self.source_path,
)
# Idempotency: delete the partition first, then load. Run both
# in a single transaction so a failure mid-way leaves the partition
# untouched, not half-replaced.
with hook.transaction() as tx:
tx.execute(
f"DELETE FROM {self.target_table} "
f"WHERE {self.partition_column} = %s",
(self.partition_date,),
)
rows_loaded = tx.copy_from_parquet(self.source_path, self.target_table)
self.log.info("Loaded %d rows into %s", rows_loaded, self.target_table)
return {"rows_loaded": rows_loaded, "partition": self.partition_date}
Things this example illustrates that aren’t obvious from the docs:
__init__ takes only kwargs. All Airflow operators use keyword-only arguments after *. This is a hard convention — it makes operator instantiation unambiguous and it’s enforced by BaseOperator’s metaclass.
Always call super().__init__(**kwargs). The base class consumes a long list of standard kwargs (task_id, retries, retry_delay, pool, trigger_rule, email_on_failure, etc.). If you intercept them, you break the operator’s integration with Airflow.
template_fields declares which constructor fields support Jinja. Anything not in this tuple is passed verbatim. template_ext is a related setting: if a field’s value ends with one of these extensions, Airflow reads the file and renders that — useful for sql_query="path/to/query.sql" patterns.
execute() is the one method you must implement. Its return value is automatically pushed to XCom under the return_value key (assuming do_xcom_push=True, the default). Keep this small — pointers, not payloads. (Section 7.5.)
The operator should be idempotent. Same input → same output, regardless of how many times it runs. The DELETE-then-INSERT pattern is the canonical example. If your operator can’t be idempotent (e.g., it sends email — sending the same email twice is bad), document that loudly and rely on Airflow’s at-most-once semantics carefully.
Heavy imports go inside execute(). The DAG file imports the operator class; the operator class imports its dependencies. If those dependencies are heavy (a warehouse SDK, ML libraries, etc.), the import cost gets paid every parse cycle. Lazy-import them in execute() so they only load when the task actually runs.
Hooks: the layer below operators
A hook is a thin Python wrapper around a Connection. The mental model: “operators are tasks; hooks are clients.” When you write a custom operator, you usually write a custom hook too — or, more often, your operator uses an existing hook.
from airflow.providers.common.sql.hooks.sql import DbApiHook
class WarehouseHook(DbApiHook):
conn_name_attr = "conn_id"
default_conn_name = "warehouse_default"
conn_type = "warehouse"
hook_name = "Internal Warehouse"
def get_conn(self):
conn = self.get_connection(self.conn_id) # parent class fetches Connection
from my_lib.warehouse import Client
return Client(host=conn.host, user=conn.login, password=conn.password,
database=conn.schema, port=conn.port or 5432)
The win of separating hooks from operators: the hook is reusable from @task functions, custom sensors, ad-hoc Python, and unit tests. The operator is just one of many call sites for the hook.
Templating user-supplied SQL
The most common bug in custom SQL operators is template injection. If you accept a partition_date parameter and inline it into SQL with f-strings, you’ve built an SQL injection vulnerability:
# WRONG
tx.execute(f"DELETE FROM t WHERE dt = '{self.partition_date}'")
# RIGHT
tx.execute("DELETE FROM t WHERE dt = %s", (self.partition_date,))
Jinja templating happens before execute() runs, so self.partition_date is already a fully-rendered string by the time you see it. That string can contain anything if a user passes anything (or if a clever upstream task XCom-pushes anything). Use parameterized queries.
Testing custom operators
Operators are normal Python classes. Test them as such. A good test rigs up the operator with a mocked hook, calls execute() with a minimal Context, and asserts the right hook calls were made:
from unittest.mock import MagicMock, patch
def test_warehouse_load_operator_replaces_partition():
op = WarehouseLoadOperator(
task_id="test", source_path="s3://b/k", target_table="raw.orders",
partition_date="2026-01-04",
)
with patch("my_pkg.operators.WarehouseHook") as mock_hook_cls:
mock_tx = MagicMock()
mock_hook_cls.return_value.transaction.return_value.__enter__.return_value = mock_tx
mock_tx.copy_from_parquet.return_value = 1234
result = op.execute(context={})
mock_tx.execute.assert_called_once_with(
"DELETE FROM raw.orders WHERE dt = %s", ("2026-01-04",)
)
assert result == {"rows_loaded": 1234, "partition": "2026-01-04"}
This is fast (no Airflow runtime needed) and it actually tests the logic. Pair it with a smoke test using airflow tasks test against a real (dev) warehouse for the integration path.
Packaging custom operators
If you have more than three or four custom operators, put them in a Python package, publish to your private PyPI (or just install from a Git URL), and depend on it from your Airflow image. This is more work than dropping operator files into dags/ next to the DAGs that use them, but:
- The operators get tested independently, in their own CI.
- DAG repos depend on a versioned operator package, so operator changes are explicit.
- The DAG processor doesn’t have to re-parse operator code as if it were DAG code.
- Multiple Airflow deployments (dev, staging, prod) can run different operator versions during migrations.
A dags/ folder with operator code mixed in is fine for two engineers and ten DAGs. Past that, separate them.
Custom operators vs. TaskFlow + helper functions
A frequent question: I have logic I want to reuse across DAGs. Do I write a custom operator, or do I write a regular Python function and call it from @task-decorated wrappers?
The signal:
- Custom operator when the logic is genuinely operator-shaped: it has Airflow-specific concerns (templating, retries, on-failure callbacks tied to context, complex state passed through the
Contextobject). When you want it to appear as a first-class type in the UI (“ah, this task is aWarehouseLoadOperator”). - Helper function called from
@taskwhen the logic is “plain Python that does a thing”: data transformation, business logic, calls to a hook. TaskFlow gives you XCom plumbing for free, the @task decorator handles the operator boilerplate, and reviewers can read the function in isolation.
When in doubt, start with a helper function. Promote to a custom operator when you have three places that would otherwise duplicate it and the logic is tangled with Airflow’s task context.
12. How It Breaks
When something is wrong, the symptoms cluster into a few patterns. Recognize them and you save hours.
Symptom: “My DAG isn’t showing up in the UI”
Most likely: parse error in the DAG file. Check:
python /path/to/dags/my_dag.py # any output here means a parse-time error or print
Also check the UI’s “Import Errors” page (top of the DAGs list). The DAG processor logs (airflow dag-processor in 3.x) will show parsing failures.
If the file parses fine but the DAG doesn’t show: the file is in the wrong directory, the .airflowignore is excluding it, or dagbag_import_timeout killed parsing.
Symptom: “My DAG shows up but tasks aren’t running”
Walk down this list, in order:
- Is the DAG paused? Check the toggle in the UI. New DAGs are often deployed paused.
- Is the scheduler running?
ps aux | grep scheduler. If it’s dead, nothing schedules. - Is the executor configured and healthy? For CeleryExecutor, are workers running? Is Redis up?
celery -A airflow.executors.celery_executor inspect activeshows running tasks. - Is the data interval complete? Remember: a daily DAG with
start_date=2026-01-05first runs after2026-01-06 00:00. If you deploy at 10am on the 5th, nothing runs that day. - Are tasks stuck in
queued? That’s the executor not picking them up. For Celery: workers down or worker concurrency exhausted. For LocalExecutor: schedulerparallelismexhausted. - Are tasks stuck in
scheduled? Pool slots exhausted,max_active_tasks_per_daghit, ordepends_on_pastblocking on a previous failed run.
Symptom: “Task runs but fails with weird errors”
Read the logs first. Always. Either via the UI or cat $AIRFLOW_HOME/logs/dag_id=.../task_id=.../run_id=.../attempt=1.log.
Common patterns:
FileNotFoundErroron a path another task wrote — you’re using local disk for inter-task communication. Move to object storage. (Section 7.6.)ConnectionErrorto your warehouse — usually credentials. Runairflow connections get my_conn_idto verify the connection definition. Hit the warehouse from a worker shell with the same credentials.TemplateNotFound— Jinja can’t find the template file. Checktemplate_searchpathon the DAG. By default it’s relative to the DAG file.UNIQUE constraint failedorduplicate key— your task is running twice (a retry, a backfill, a cleared instance) and isn’t idempotent. Fix the task to be safe to rerun (delete-then-insert, MERGE/UPSERT keyed by date).Task exited with return code Negsignal.SIGKILL— the OS killed your process, almost always OOM. Either give the worker more memory, push the heavy work to an external system (Spark, EMR), or split the task into smaller pieces.
Symptom: “Scheduler is slow / DAGs schedule late”
- Check DAG parse times. If your slowest file takes >10s to parse, the scheduler is bleeding cycles. Profile with
python -c "import time; t=time.time(); import my_dag; print(time.time()-t)". Common culprit: heavy imports or top-level I/O. (Sections 5.1, 7.1.) - Check metadata DB. Slow queries on
task_instanceindicate the table has grown huge. Runairflow db clean --tables task_instance,dag_run --clean-before-timestamp <date>periodically. - Check connection count.
SELECT count(*) FROM pg_stat_activityon Postgres. If you’re nearmax_connections, you need PgBouncer. - Check scheduler concurrency. Add a second scheduler — they cooperate via the DB.
Symptom: “Backfill is overwhelming the warehouse”
You launched a backfill and now the warehouse is on fire. Symptoms: every task is slow, downstream consumers are timing out.
- Use
--max-active-runsto limit parallelism:airflow dags backfill my_dag --max-active-runs 1runs intervals serially. - Put the heavy task in a Pool with N slots:
pool="warehouse_writes", pool_slots=1and a pool size of 4 means at most 4 concurrent warehouse writes across the whole Airflow instance. - Stop the backfill:
airflow dags backfill --reset-dagrunsclears the runs, or just kill the runs in the UI and clear them.
General debugging workflow
When in doubt:
- Check the UI’s Grid view — gives you state-over-time at a glance.
- Check task logs — they answer 80% of “why did it fail” questions.
airflow tasks testthe failing task locally with the same logical date — reproduces without the scheduler.- Check the metadata DB —
SELECT * FROM task_instance WHERE dag_id = ... AND state = 'failed' ORDER BY queued_dttm DESC LIMIT 20;is sometimes faster than the UI for spotting patterns. - Check scheduler/worker/triggerer process logs — for environmental issues (DB unreachable, executor misconfigured), the problem isn’t in your DAG, it’s in the platform.
13. The Taste Test
If I review a codebase and see these patterns, I form an instant opinion of the engineer.
Beginner
# DAG file:
import requests
from airflow import DAG
from airflow.models import Variable
from datetime import datetime
import pandas as pd
import tensorflow as tf # imported at top level
# Heavy config fetch at top level — runs every parse
config = requests.get("https://config.internal/airflow").json()
api_key = Variable.get("api_key") # hits the metadata DB on every parse
dag = DAG(
"data_pipeline",
start_date=datetime(2024, 1, 1), # in the past, no catchup setting → defaults to True
schedule_interval="@daily", # deprecated kwarg name in modern Airflow
)
def process(**kwargs):
# passes a dataframe through XCom
df = pd.read_parquet("/tmp/data.parquet") # local disk between tasks
return df.to_dict()
# tasks built with no retries, no defaults, repeated config
# DAG run name: my_dag with execution_date used as today's date in SQL
sql = "INSERT INTO metrics SELECT * FROM events WHERE date = '{{ execution_date }}'"
What’s wrong:
- Top-level HTTP and DB calls — run every 30s.
- Heavy imports at the top — slow parsing.
- Past
start_datewith defaultcatchup=True— instant flood of historical runs on deploy. - Using
execution_date(deprecated) for “today’s data.” INSERTwithout delete-first — duplicates on every rerun, not idempotent.- DataFrame through XCom — DB bloat, eventual failure.
- Local disk between tasks — works on dev, fails in prod.
Experienced
# DAG file:
from datetime import timedelta
import pendulum
from airflow.sdk import dag, task
DEFAULT_ARGS = {
"owner": "data-platform",
"retries": 3,
"retry_delay": timedelta(minutes=5),
"retry_exponential_backoff": True,
}
@dag(
dag_id="orders_daily",
description="Daily aggregation of orders into the metrics warehouse.",
schedule="0 2 * * *",
start_date=pendulum.datetime(2026, 1, 1, tz="UTC"),
catchup=False,
max_active_runs=1,
default_args=DEFAULT_ARGS,
tags=["finance", "daily"],
)
def orders_daily():
@task
def extract(data_interval_start, data_interval_end):
# Heavy import deferred to runtime
from my_lib.warehouse import read_orders
rows = read_orders(start=data_interval_start, end=data_interval_end)
path = f"s3://etl-staging/orders/{data_interval_start:%Y-%m-%d}/data.parquet"
write_parquet(rows, path)
return path # pointer, not payload
@task
def load(staging_path: str, data_interval_start):
# Idempotent: delete partition, then insert
run_warehouse_sql(f"""
DELETE FROM metrics.orders WHERE order_date = '{data_interval_start:%Y-%m-%d}';
COPY INTO metrics.orders FROM '{staging_path}';
""")
load(extract())
orders_daily()
What this signals:
catchup=False,max_active_runs=1,retries=3with exponential backoff — production hygiene.- Heavy imports inside the task function — DAG file parses in milliseconds.
- Uses
data_interval_startdirectly — nods/execution_dateconfusion. - Pointer-through-XCom, payload in object storage — scales.
- Idempotent load with delete-then-insert keyed by date — backfill-safe.
- Cron schedule (
"0 2 * * *") — explicit about the run time, doesn’t rely on the@dailypreset. default_argsset at the DAG level — DRY.- Pendulum for timezone-aware datetime — Airflow’s preferred datetime library.
Configuration that reveals taste
A experienced engineer’s airflow.cfg overrides include things like:
[core]
executor = CeleryExecutor
parallelism = 64
default_pool_task_slot_count = 32
[scheduler]
parsing_processes = 4
min_file_process_interval = 60 # raised because they have hundreds of DAGs
zombie_detection_interval = 30
job_heartbeat_sec = 5
[secrets]
backend = airflow.providers.hashicorp.secrets.vault.VaultBackend
[logging]
remote_logging = True
remote_base_log_folder = s3://my-bucket/airflow-logs
A beginner’s airflow.cfg is the default with one or two random changes and no Secrets Backend.
The Grid view tells the truth
Open the Grid view. A healthy DAG looks like a clean horizontal stripe of green for each task across days. An unhealthy DAG looks like a Christmas tree: red, yellow, occasional green. Lots of yellow (“up_for_retry”) with eventual green is fine — it means transient failures are being handled. Lots of red that goes green only after a manual clear means flaky retries aren’t tuned right. Stripes of skipped tasks following a pattern usually indicate trigger rules that aren’t being tracked carefully.
An experienced reviewer who sees a clean Grid view assumes the DAG was authored thoughtfully. A reviewer who sees a Christmas tree assumes the DAG is being held together by manual intervention.
14. The Downsides
Section 8.8 covered when not to choose Airflow. This section covers the things that hurt even when Airflow is the right choice — the tradeoffs you live with.
The operational tax
Airflow has more moving parts than the problem it solves seems to need. Even a modest production deployment runs: scheduler, DAG processor, API server (or webserver), one or more triggerers, a Celery worker pool (or a Kubernetes integration), Redis or RabbitMQ, Postgres, PgBouncer, and a shared filesystem or git-sync sidecar. Each of these has its own logs, its own metrics, its own failure modes, its own version pinning. You will spend real engineering time on the platform itself, separate from any time spent writing DAGs.
This is a fixed cost, not a per-DAG cost. Five DAGs cost almost the same to operate as five hundred. Below some threshold of usage, the operational tax dominates the value; above it, Airflow’s machinery starts paying off. If you have a small handful of jobs, you’ll feel the weight. If you have hundreds, you’ll appreciate it.
Managed offerings (Astronomer, Cloud Composer, MWAA) offload most of this work in exchange for money and some loss of control. They’re often the right call for teams below the “we have a dedicated platform engineer” threshold.
Python-as-DAG-format is a double-edged sword
Defining DAGs in code, not YAML, is Airflow’s defining design choice. It’s also the source of half its problems.
The good half: the full power of Python. Loops, helper functions, factories, real testing, real refactoring, real version control. Other orchestrators that use YAML or GUI builders cap out at “moderate complexity” in ways Airflow doesn’t.
The bad half: the DAG file is real Python that runs continuously in the scheduler. Top-level code runs every 30 seconds. Heavy imports are expensive. Database calls in DAG bodies hammer the metadata DB. Random values produce DAG version inflation. The mental tax of “every line in this file is paid for, repeatedly, forever” is real, and most engineers don’t internalize it until they’ve caused a production incident with it. (Mental Model 1, Section 7.1.)
Newer orchestrators (Dagster, Prefect) made the opposite tradeoff: define your pipeline in code, but parse it once at deploy time, then store the parsed form. This is strictly better from an operational standpoint, but it’s a fundamental architectural difference Airflow can’t easily retrofit.
Local development is awkward
To run an Airflow DAG locally — really run it, with the scheduler and a worker — you need most of the production stack. The Astronomer CLI and Docker Compose setups make this tractable, but compared to “run a Python function and watch it execute,” it’s heavy. You can use airflow tasks test to run individual tasks without the scheduler, which is what experienced engineers actually do, but onboarding new engineers to a “run the whole DAG locally” workflow is a real friction point.
This shows up in test culture too. Airflow DAG code is harder to unit-test than ordinary Python because tasks have implicit dependencies on Airflow’s runtime context. Teams that take testing seriously typically end up extracting their business logic into plain Python modules (testable normally) and keeping the DAG file as thin orchestration glue. That’s the right pattern, but Airflow doesn’t push you toward it.
The metadata database is a single point of failure and a scaling bottleneck
Everything goes through the metadata DB. Scheduler decisions, task state, XComs, logs of state changes, serialized DAGs, connection definitions, variables. Lose the DB; lose Airflow. Saturate the DB; everything slows.
In practice this means:
- You will run PgBouncer. Without it, connection counts will eat you.
- You will configure regular cleanup of
task_instance,dag_run,xcom, andlogtables.airflow db cleanexists for this. Without it, your DB grows forever and queries slow down. - You will treat the DB as a tier-1 production database — backups, replication, monitoring, alerting on lag. Many shops underestimate this and discover the hard way.
For most deployments this is fine; Postgres with proper care handles even large Airflow installations. But the architecture requires that level of care. There’s no graceful degradation if the DB is slow.
The scheduler is the throughput bottleneck for high-task-rate workloads
Airflow’s scheduler can dispatch on the order of low-thousands of tasks per minute on a single instance, more with multiple schedulers. For most teams this is plenty. For teams running hundreds of thousands of tasks per day with tight SLAs, the scheduler becomes the bottleneck: tasks pile up in scheduled, the scheduler loop falls behind, latency from “ready to run” to “actually running” creeps up.
Mitigations exist (multiple schedulers, careful min_file_process_interval tuning, dynamic task iteration in 3.3+ to reduce task instance count, fewer dependencies per DAG), but at extreme scale you may find Airflow’s per-task overhead more expensive than the value it provides for ultra-fast tasks. This is a tail-of-the-distribution problem, but it’s real.
The UI is functional, not delightful
Airflow’s web UI has improved considerably (especially the React rewrite in 3.0), but it’s a tool for operators, not a polished product. The Grid view is the workhorse, and it’s good. Beyond that: log search is per-task-instance and clunky, lineage visualization is rudimentary, custom dashboards require external tooling, the gap between “run failed” and “I know why” still routes through reading logs.
The UI is not where this product invests. If you want rich observability, you’ll build it externally — Datadog, Grafana, Prometheus exporters, custom dashboards on top of the metadata DB.
The provider ecosystem is a strength and a quality risk
Airflow has hundreds of providers covering nearly every data system you might want to talk to. This is a genuine moat. But providers vary wildly in quality. Some are maintained by the upstream vendor (Snowflake, Databricks, AWS) and are excellent. Some are community-maintained and lag behind their underlying APIs. Some have surprising defaults, undocumented behaviors, or subtle differences from their non-Airflow counterparts.
Trust the provider for the well-loved cases. Read the source for anything important. Don’t assume “there’s a provider for it” means “the integration works perfectly.”
Versioning and upgrades require real care
Airflow upgrades are not casual. The 2.x → 3.x transition removed direct database access from task code, moved standard operators to a separate provider package, split the webserver into API server + DAG processor, and changed JWT key handling — every one of these is a breaking change requiring code review and testing. Even minor version bumps occasionally introduce surprises.
Pin your Airflow version in your image and your provider versions in your requirements. Use the official constraints file for installation. Treat upgrades as a project, not a maintenance task.
Data-awareness is a retrofitted concept
Airflow was born task-first (“orchestrate these steps”); the data-aware features (Datasets/Assets, Asset partitions, event-driven scheduling) have been added gradually over the 2.x and 3.x releases. They work, and they’re improving fast. But the core abstraction is still “DAG of tasks,” and asset reasoning is layered on top. Tools designed asset-first from inception (Dagster, in particular) feel more natural for “I want my pipeline to revolve around the freshness and lineage of these tables.”
If your day-to-day questions are “is the orders table up-to-date?” and “what depends on the orders table?”, you’ll find yourself working slightly against Airflow’s grain. Possible — many teams do it — but slightly against the grain.
Cost of complex DAGs is paid in operational toil
A 200-task DAG with intricate trigger rules, dynamic task mapping, sub-DAGs (deprecated, but still seen), and branching is a maintenance burden. The Grid view becomes hard to read. Re-runs become surgical. Failures cascade in ways that are hard to predict. This isn’t unique to Airflow — any orchestrator will struggle with deeply complex DAGs — but Airflow’s defaults don’t strongly push you toward simplicity. The discipline to keep DAGs small and focused has to come from the team.
What this list adds up to
Airflow is a powerful, mature, and broadly capable tool with a real operational cost and some genuinely awkward design legacies. None of these downsides is a deal-breaker on its own. Together, they explain why teams sometimes choose alternatives, and why even teams who stay with Airflow invest heavily in tooling, conventions, and managed services to soften the rough edges.
The honest summary: Airflow is the boring choice that works, with a tax. Whether the tax is worth it depends on your scale, your team, and what you’re orchestrating.
15. Where to Go Deeper
Curated, opinionated. Skip the rest.
-
The official Airflow docs — Concepts and Best Practices sections.
https://airflow.apache.org/docs/apache-airflow/stable/. Most blog posts paraphrase these badly. The official pages on DAG Runs, Best Practices, and Deferring are unusually well-written for OSS docs. Read them once you’ve used Airflow for a few weeks; you’ll understand 30% more than the first time you tried. -
“Data Pipelines with Apache Airflow” by Bas Harenslak and Julian de Ruiter (Manning). The canonical book. It’s Airflow 2-era but the mental models translate. The chapters on idempotency, scheduling, and dependencies are worth the price of the book even if you skip the rest.
-
Astronomer’s documentation and webinars.
https://www.astronomer.io/docs/learn. Astronomer is the commercial steward of Airflow; their guides on the executor zoo, deferrable operators, and dynamic task mapping are the best intermediate-level material on the internet. They have an obvious commercial interest, but the technical content is solid. -
Maxime Beauchemin’s blog posts (“The Rise of the Data Engineer,” “The Downfall of the Data Engineer,” and the original Airflow design doc). The creator’s perspective on why Airflow is shaped the way it is. Read these to understand the historical context — what Airflow was solving and what it deliberately didn’t try to solve.
-
The Airflow Summit talks, especially “Deep Dive into the Airflow Scheduler” (multiple years). Free on YouTube. The scheduler internals talk in particular will save you hours of mystified log-staring.
-
Build a real pipeline end-to-end. Pick a task: ingest from a public API daily, transform, load into a local Postgres or DuckDB, fire a Slack notification. Use the TaskFlow API. Use deferrable sensors. Use a custom XCom backend. Use Pools. Configure a Secrets Backend. Run it for a month and let things break. The intuition you build from one production-shaped DAG that you operate is worth ten tutorials.
-
The Airflow GitHub repo’s
airflow/example_dags/directory.https://github.com/apache/airflow/tree/main/airflow/example_dags. The examples maintained by the project maintainers are a better stylistic reference than most blogs. They show idiomatic usage of every feature. -
Compare with Dagster (and to a lesser extent, Prefect). Spend a weekend writing the same pipeline in Dagster. You don’t have to switch — but seeing how an asset-first orchestrator differs sharpens your understanding of Airflow’s task-first model. You’ll come back understanding Airflow’s strengths and its philosophical limits.
The depth comes from reps. Read the docs. Build the pipeline. Operate it. Get paged at 3am once. Then read the docs again.
16. Conclusions
If you take only a handful of things away from this document, take these.
Airflow is a scheduler, not a data engine. Internalize this and a hundred design choices stop being mysterious. Tasks should call out to specialized systems (warehouses, Spark, dbt, Kubernetes) to do real work. Airflow’s job is to know which systems to call, in what order, with what parameters, and what to do when one fails. Use it for what it is.
The DAG file is a Python program that runs continuously in the scheduler. The single most consequential fact about Airflow’s runtime model. Top-level code is paid for, repeatedly, on every parse. Heavy imports, network calls, database queries, and dynamic constructor arguments at the top level are not bugs of style — they are concrete performance and stability problems that scale with your DAG count. Keep DAG file bodies to imports and structure; defer all real work to inside task functions.
Time in Airflow is interval-based, not instant-based. A DAG run with logical date 2026-01-04 represents the data interval [2026-01-04, 2026-01-05) and runs after the interval ends. If you remember this, you’ll never be surprised by a daily DAG that “doesn’t run on the day you deployed it” or by a task that processes “yesterday’s data when I asked for today’s.” Use data_interval_start and data_interval_end. Treat ds and execution_date as legacy concepts.
Idempotency is your responsibility. Airflow will rerun your tasks — for retries, for clears, for backfills, for catch-up. It assumes idempotency. It does not enforce it. Every task that writes to an external system should be safe to run multiple times: delete-then-insert, upsert/merge keyed by date, content-addressable outputs, transactionally rollback-able mutations. Build this in from the start; retrofitting it onto a non-idempotent pipeline is painful.
XComs carry pointers, not payloads. The metadata database is the central nervous system, and you should treat it as such. Don’t shovel data through it. Put data in object storage; pass the path through XCom. This single rule will save you from a class of failures that look mysterious until you understand them.
Choose your executor and your sensors carefully. Default to CeleryExecutor for most production workloads, KubernetesExecutor for heterogeneous-runtime jobs, LocalExecutor for tiny deployments and dev. Default to mode='reschedule' or deferrable variants for sensors that wait more than a couple of minutes. These two choices alone determine whether your cluster lives in steady state or perpetually feels overloaded.
Treat the metadata database as a first-class production system. PgBouncer, regular cleanup, monitoring, backups, replication. Without this, the database becomes a recurring source of pain.
The deeper insight underneath all of these: Airflow is a system whose abstractions leak, in well-defined and learnable ways, but only after you understand what they’re abstracting over. Scheduler decisions are over the metadata database. Operators are over BaseOperator’s dispatch loop. Sensors are over a poke() polling pattern. XCom is over rows in a database. The @task decorator is over PythonOperator instantiation. Once you can mentally translate any DAG you read into the underlying machinery, surprises stop being surprises and become predictable consequences.
That’s the difference between “I know how to write Airflow DAGs” and “I understand Airflow.” The first lets you produce working pipelines. The second lets you predict what will happen when you change something, debug what went wrong when something fails, and decide when Airflow is — and isn’t — the right tool for the job at hand.
This document was about getting you to the second.
The ideas are mine. The writing is AI assisted