Airflow

This documentation is for the new Analytical Platform Airflow service.

For Data Engineering Airflow, please refer to Data Engineering Airflow.

Overview

Apache Airflow is a workflow management platform. Analytical Platform users primarily use it for:

automating data engineering pipelines
training machine learning models
reproducible analytical pipelines (RAP)

We recommend using it for long-running or compute intensive tasks.

Workflows are executed on the Analytical Platform’s Kubernetes infrastructure and can interact with services such as Amazon Athena, Amazon Bedrock, and Amazon S3.

Our Kubernetes infrastructure is connected to the MoJO Transit Gateway, which connects to:

MoJ Cloud Platform
MoJ Modernisation Platform
HMCTS SDP
SOP

If you need additional connectivity, submit a feature request.

Please note: You cannot use Analytical Platform Airflow for workflows using BashOperator or PythonOperator

Concepts

The Analytical Platform Airflow is made up of environments, projects and workflows:

Environments are the different stages of infrastructure we provide: development, test and production.

Please note: development is not connected to the MoJO Transit Gateway

Projects are a unit for grouping workflows dedicated to a distinct business domain, service area, or specific project, for example: BOLD, HMCTS or HMPPS.
Workflows are pipelines, also known as directed acyclic graph (DAGs). They consist of a list of tasks organised to reflect the relationships between them. The workflow definition includes additional information, such as your repository name and release tag.

Getting started

Before you can use Airflow, you’ll need to:

request Airflow access
create a GitHub repository
create a GitHub release
create an Airflow pipeline project and workflow

Follow the next steps to get started.

Request Airflow access

To access the Airflow components, you’ll need to:

have a GitHub account (see our Quickstart guide)
join the ministryofjustice GitHub organisation

If you are a member of Data Engineering’s GitHub team (@ministryofjustice/data-engineering), you are automatically granted access and do not need to submit a request

When you have joined the ministryofjustice GitHub organisation, create a pull request in the Data Platform Github Access GitHub repository which adds your GitHub username to the Analytical Platform Airflow team defined here. Send a message with your pull request link in the #ask-analytical-platform Slack channel to request a review.

After your pull request has been merged, you will be added to a GitHub team that will give you access to our GitHub repository, and AWS environments.

Our team manually approves requests

Once merged, it can take up to three hours to gain access to AWS

Create a GitHub repository

If you already have a repository you’ve used for Data Engineering Airflow, please refer to migrating from Data Engineering Airflow

Create a repository using one of the provided runtime templates:

You can create this repository in either the ministryofjustice or moj-analytical-services GitHub organisation

Repository standards, such as branch protection, are out of scope for this guidance

For more information on runtime templates, please refer to runtime templates

2. Add your code to the repository, including the script(s) your want Airflow to run and a file for your package management

3. Update the Dockerfile instructions to copy your code into the image, install packages required to run, and call the script(s) to run. For example, for Python:

FROM ghcr.io/ministryofjustice/analytical-platform-airflow-python-base:1.18.0

USER root

COPY requirements.txt requirements.txt
COPY src/ .
RUN pip install -r requirements.txt

USER ${CONTAINER_UID}

ENTRYPOINT ["python3", "main.py"]

For more information on runtime images, please refer to runtime images

Create a GitHub release

Follow GitHub’s documentation on creating a release. Make note of the release tag.
After you’ve created a release, check if your container image has been successfully built and published by logging in to the Analytical Platform Common Production AWS account

You can also see our example Python repository.

Create an Airflow pipeline project and workflow

To initialise an Airflow pipeline project, create a directory in the relevant environment in the Airflow repository, for example, environments/development/analytical-platform.

To create a Airflow pipeline workflow (a DAG), you need to provide a workflow manifest (workflow.yml) file in your project under a workflow identifier name.

This manifest file specifies the desired state for workflow, and provides contextual information used to categorise and label the workflow.

For example, create environments/development/analytical-platform/example/workflow.yml, where example is an identifier for your workflow’s name.

The minimum requirements for a workflow manifest look like this:

dag:
  repository: moj-analytical-services/analytical-platform-airflow-python-example
  tag: 2.0.0

maintainers:
  - jacobwoffenden

tags:
  business_unit: Central Digital
  owner: analytical-platform@justice.gov.uk

dag.repository is the name of the GitHub repository where your code is stored and release has been created
dag.tag is the tag you used when creating the release in your GitHub repository
maintainers are a list of GitHub usernames of individuals responsible for maintaining the workflow, and updating any secret values
tags.business_unit must be one of Central Digital, CICA, HMCTS, HMPPS, HQ, LAA, OPG, Platforms, Technology Services
tags.owner must be an email address ending with @justice.gov.uk

Providing a tags.business_unit other than Central Digital, CICA, HMCTS, HMPPS, HQ, LAA, OPG, Platforms, Technology Services will result in an error.

Workflow scheduling

There are several options for configuring your workflow’s schedule. By default, if no options are specified, you must manually trigger it in the Airflow console.

The following options are available under dag:

catchup: please refer to Airflow’s guidance (defaults to false)
depends_on_past: when set to true, task instances will run sequentially while relying on the previous task’s schedule to succeed (defaults to false)
end_date: the timestamp (YYYY-MM-DD) that the scheduler won’t go beyond (defaults to null)
is_paused_upon_creation: specifies if the dag is paused when created for the first time (defaults to false)
max_active_runs: maximum number of active workflow runs (defaults to 1)
retries: the number of retries that should be performed before failing the task (defaults to 0)
retry_delay: delay in seconds between retries (defaults to 300)
schedule: cron expression that defines how often the workflow runs (defaults to null), can also use dataset scheduling as shown in the example-dataset-schedule workflow. Time-delta scheduling can be set by using seconds, minutes or hours as shown in the example-timedelta workflow
start_date: the timestamp (YYYY-MM-DD) from which the scheduler will attempt to backfill (defaults to 2025-01-01)
inlets: used to provide lineage when using dataset scheduling. An example is shown in the example-dataset-schedule workflow
outlets: used in conjunction with downstream dataset scheduling. An example is shown in the example-outlet workflow

The example-schedule workflow shows an example of a workflow that runs at 08:00 every day and retries 3 times, with a 150 second delay between each retry:

dag:
  repository: moj-analytical-services/analytical-platform-airflow-python-example
  tag: 2.0.0
  retries: 3
  retry_delay: 150
  schedule: "0 8 * * *"

Workflow tasks

Providing the minimum keys under dag will create a main task. This task will execute the entrypoint of your container and provide a set of default environment variables; for example, in development:

AWS_DEFAULT_REGION=eu-west-1
AWS_ATHENA_QUERY_EXTRACT_REGION=eu-west-1
AWS_DEFAULT_EXTRACT_REGION=eu-west-1
AWS_METADATA_SERVICE_TIMEOUT=60
AWS_METADATA_SERVICE_NUM_ATTEMPTS=5
AIRFLOW_ENVIRONMENT=DEVELOPMENT
AIRFLOW_RUN_ID=<Airflow variable {{ run_id }} e.g. manual__2025-07-02T07:13:37+00:00>
AIRFLOW_TIMESTAMP=<Airflow variable {{ ts }} e.g. 2025-07-02T07:13:37+00:00>
AIRFLOW_TIMESTAMP_NO_DASH=<Airflow variable {{ ts_nodash }} e.g. 20250702T071337>
AIRFLOW_TIMESTAMP_NO_DASH_WITH_TZ=<Airflow variable {{ runts_nodash_with_tz_id }} e.g. 20250702T071337+0000>

Environment variables

To pass extra environment variables, you can reference them in env_vars, like this:

dag:
  repository: moj-analytical-services/analytical-platform-airflow-python-example
  tag: 2.0.0
  env_vars:
    x: "1"

You can also pass parameters to your workflows using dag.params, for example:

dag:
  repository: moj-analytical-services/analytical-platform-airflow-python-example
  tag: 2.0.0
  params:
    example: placeholder-value

The value will then be injected into your workflow as PARAM_EXAMPLE

Compute profiles

We provide a mechanism for requesting minimum levels of CPU and memory from our Kubernetes cluster. You can additionally specify if your workflow should run on on-demand or can run on spot compute (which can be disrupted).

This is done using the dag.compute_profile key, and by default (if not specified), your workflow task will use general-spot-1vcpu-4gb, which means:

general: the compute fleet
spot: the compute type
1vcpu: 1 vCPU is guaranteed
4gb: 4GB of memory is guaranteed

In addition to the general fleet, we also offer gpu, which provides your workflow with an NVIDIA GPU pre-installed with CUDA.

The full list of available compute profiles can be found here.

Analytical Platform tooling (such as JupyterLab, RStudio and Visual Studio Code) has access to 1 vCPU and 12GB RAM. The closest compute profile is general-on-demand-4vcpu-16gb.

Multi-task

Workflows can also run multiple tasks, with dependencies on other tasks in the same workflow. To enable this, specify the tasks key, for example:

dag:
  repository: moj-analytical-services/analytical-platform-airflow-python-example
  tag: 2.0.0
  env_vars:
    x: "1"
  tasks:
    init:
      env_vars:
        y: "0"
    phase-one:
      env_vars:
        y: "1"
      compute_profile: cpu-spot-2vcpu-8gb
      dependencies: [init]
    phase-two:
      env_vars:
        y: "2"
      compute_profile: gpu-spot-1vcpu-4gb
    phase-three:
      env_vars:
        x: "2"
        y: "3"
      dependencies: [phase-one, phase-two]

Tasks take the same keys (env_vars and compute_profile) and can also take dependencies, which can be used to make a task dependent on other tasks completing successfully.

You can define global environment variables under dag.env_vars, making them available in all tasks. You can then override these by specifying the same environment variable key in the task.

compute_profile can either be specified at dag.compute_profile to set it for all tasks, or at dag.tasks.{task_name}.compute_profile to override it for a specific task.

Workflow identity

By default, for each workflow, we create an associated IAM policy and IAM role in the Analytical Platform’s Data Production AWS account.

The name of your workflow’s role is derived from its environment, project, and workflow: airflow-${environment}-${project}-${workflow}.

To extend the permissions of your workflow’s IAM policy to include access to Athena, Bedrock, Glue, KMS ARNs and/or S3 buckets, you can do so under the top-level iam key in your workflow manifest, for example:

iam:
  athena: write
  bedrock: true
  glue: true
  kms:
    - arn:aws:kms:eu-west-2:123456789012:key/mrk-12345678909876543212345678909876
  s3_deny:
    - mojap-compute-development-dummy/deny1/*
    - mojap-compute-development-dummy/deny2/*
  s3_read_only:
    - mojap-compute-development-dummy/readonly1/*
    - mojap-compute-development-dummy/readonly2/*
  s3_read_write:
    - mojap-compute-development-dummy/readwrite1/*
    - mojap-compute-development-dummy/readwrite2/*
  s3_write_only:
    - mojap-compute-development-dummy/writeonly1/*
    - mojap-compute-development-dummy/writeonly2/*

iam.athena: Can be read or write, to provide access to Amazon Athena (write includes read)
iam.bedrock: When set to true, enables Amazon Bedrock access
iam.glue: When set to true, enables AWS Glue
iam.kms: A list of KMS ARNs used for encrypt and decrypt operations if objects are KMS encrypted
iam.s3_deny: A list of Amazon S3 paths to deny access
iam.s3_read_only: A list of Amazon S3 paths to provide read-only access
iam.s3_read_write: A list of Amazon S3 paths to provide read-write access
iam.s3_write_only: A list of Amazon S3 paths to provide write-only access

Advanced configuration

External IAM roles

If you would like your workflow’s identity to run in an account that is not Analytical Platform Data Production, you can provide the ARN using iam.external_role, for example:

iam:
  external_role: arn:aws:iam::123456789012:role/this-is-not-a-real-role

You must have an IAM Identity Provider using the associated environment’s Amazon EKS OpenID Connect provider URL. Please refer to Amazon’s documentation. We can provide the Amazon EKS OpenID Connect provider URL upon request.

You must also create a role that is enabled for IRSA. We recommend using this Terraform module.

You must use the following when referencing service accounts:

mwaa:${project}-${workflow}

Workflow secrets

To provide your workflow with sensitive information, such as a username, password or API key, you can pass a list of secret identifiers using the secrets key in your workflow manifest, for example:

secrets:
  - username
  - password
  - api-key

This will create an encrypted secret in AWS Secrets Manager in the following path: /airflow/${environment}/${project}/${workflow}/${secret_id}, (which you can populate with the secret value via the AWS console) and it will then be injected into your container using an environment variable, for example:

SECRET_USERNAME=xxxxxx
SECRET_PASSWORD=yyyyyy
SECRET_API_KEY=zzzzzz

Secret names with hyphens (-) will be converted to use underscores (_) for the environment variable.

Updating a secret value

Secrets are initially created with a placeholder value. To update this, your GitHub username must be listed in the maintainers section of the workflow manifest file, and then log in to the Analytical Platform Data Production AWS account and update the value.

Workflow notifications

Email

To enable email notifications, add the following to your workflow manifest:

notifications:
  emails:
    - analytical-platform@justice.gov.uk
    - data-platform@justice.gov.uk

Slack

To enable Slack notifications, you need to:

Add the following to your workflow manifest:

notifications:
  slack_channel: your-channel-name # e.g. analytical-platform

Invite Analytical Platform’s Slack application (@Analytical Platform) to your channel

Workflow metrics

You can view your workflow’s metrics on Analytical Platform’s Grafana instance.

You may need to change the environment to select your task ID. You may also need to change the time range.

If your workflow’s logs are missing from the Airflow console, please get in touch with us.

Python DAG configuration

This functionality is experimental, and is subject to change

Analytical Platform Airflow uses Astronomer’s dag-factory package to provide a simple YAML interface for creating DAGs, however this does not cover advanced DAG configuration that some users require.

To enable Python DAGs, we have introduced a new flag, dag.python_dag, in workflow.yml. Setting this flag to true allows you to include dag.py in your workflow folder.

When using dag.python_dag you can remove the following:

dag.catchup
dag.depends_on_past
dag.end_date
dag.is_paused_upon_creation
dag.max_active_runs
dag.retries
dag.retry_delay
dag.schedule
dag.start_date
dag.env_vars
dag.compute_profile
dag.tasks
dag.params
notifications.email
notifications.slack_channel

Python DAGs must conform to some requirements we set out:

It must use operators that perform tasks on the Airflow control plane appropriately
It must use AnalyticalPlatformStandardOperator when running tasks on Kubernetes

It must contain a specific set of placeholder values, seen below. These will be automatically replaced with the appropriate values before your DAG is sent to our airflow instance.

REPOSITORY_NAME="PLACEHOLDER_REPOSITORY_NAME"
REPOSITORY_TAG="PLACEHOLDER_REPOSITORY_TAG"
PROJECT="PLACEHOLDER_PROJECT"
WORKFLOW="PLACEHOLDER_WORKFLOW"
ENVIRONMENT="PLACEHOLDER_ENVIRONMENT"
OWNER="PLACEHOLDER_OWNER"

An example of the minimum we expect can be found here.

`AnalyticalPlatformStandardOperator`

We provide a wrapper for KubernetesPodOperator to provide sensible defaults when running tasks on our Kubernetes cluster.

You can provide standard KubernetesPodOperator arguments such as cmds or env_vars, however to pass secrets, you will need to use secrets as follows

...
from airflow.providers.cncf.kubernetes.secret import (
    Secret,
)
...
task = AnalyticalPlatformStandardOperator(
    dag=dag,
    task_id="main",
    name=f"{PROJECT}.{WORKFLOW}",
    compute_profile="general-spot-1vcpu-4gb",
    image=f"509399598587.dkr.ecr.eu-west-2.amazonaws.com/{REPOSITORY_NAME}:{REPOSITORY_TAG}",
    environment=f"{ENVIRONMENT}",
    project=f"{PROJECT}",
    workflow=f"{WORKFLOW}"
    secrets=[
        Secret(
            deploy_type="env",
            deploy_target="SECRET_EXAMPLE",
            secret=f"{PROJECT}-{WORKFLOW}-example",
            key="data"
        )
    ]
)

Slack Notifications in Python DAGs

To send notifications to Slack from a Python DAG follow the guidance here.

This is demonstrated in our Python DAG example.

Accessing the Airflow console

To access the Airflow console, you can use these links:

Runtime templates

We provide repository templates for the supported runtimes:

Python
R

These templates include:

GitHub Actions workflow to build and scan your container for vulnerabilities with Trivy
GitHub Actions workflow to build and test your container’s structure
GitHub Actions workflow to perform a dependency review of your repository, if it’s public
GitHub Actions workflow to build and push your container to the Analytical Platform’s container registry
Dependabot configuration for updating GitHub Actions, Docker, and dependencies such as Pip

The GitHub Actions workflows call shared workflows we maintain here.

Vulnerability scanning

The GitHub Actions workflow builds and scans your container for vulnerabilities with Trivy, alerting you to any CVEs (Common Vulnerabilities and Exposures) marked as HIGH or CRITICAL that have a fix available. You will need to either update the offending package or skip the CVE by adding it to .trivyignore in the root of your repository.

Configuration testing

To ensure your container is running as the right user, we perform a test using Google’s Container Structure Test tool.

The source for the test can be found here.

Accessing private repositories from GitHub Actions

Starting from v2.0.0 of ministryofjustice/analytical-platform-airflow-github-actions, we have enabled the use of Octo STS in all container build workflows.

To enable Octo STS, you will need to follow our guidance for creating the Octo STS configuration on the source repository, and then create .github/analytical-platform/octo-sts.json in your repository:

[
  {
    "scope": "<GitHub organisation>/<GitHub repository>", // Source repository to clone e.g. ministryofjustice/private-repository
    "identity": "<STS identity>",                         // STS identity e.g. ministryofjustice-example-repository
    "branch": "<branch or tag>",                          // Branch or tag of the source repository to checkout when cloning
  }
]

This will clone the repository defined in scope to ${GITHUB_WORKSPACE}/${scope} (e.g. ${GITHUB_WORKSPACE}/ministryofjustice/private-repository).

You can then amend your Dockerfile to copy that folder and make it available:

COPY ministryofjustice/private-repository private-repository

Runtime images

We provide container images for the supported runtimes:

Python
R

These images include:

AWS CLI
NVIDIA GPU drivers

Additionally, we create a non-root user (analyticalplatform) and a working directory (/opt/analyticalplatform).

Installing system packages

Our runtime images are set to run as a non-root user (analyticalplatform) which cannot install system packages.

To install system packages, you will need to switch to root, perform any installations, and switch back to analyticalplatform, for example:

FROM ghcr.io/ministryofjustice/analytical-platform-airflow-python-base:1.18.0
USER root # Switch to root
RUN <<EOF
apt-get update # Refresh APT package lists
apt-get install --yes ${PACKAGE} # Install packages
apt-get clean --yes # Clear APT cache
rm --force --recursive /var/lib/apt/lists/* # Clear APT package lists
EOF
USER ${CONTAINER_UID} # Switch back to analyticalplatform

Migrating from Data Engineering Airflow

GitHub repository

If you have an existing repository that was created using moj-analytical-services/template-airflow-python or moj-analytical-services/template-airflow-r, you need to perform the following actions:

Remove .github/workflows/ecr_push.yml
Add the GitHub Actions workflows (.github/workflows) from the equivalent runtime template
Add the Dependabot configuration (.github/dependabot.yml) from the equivalent runtime template
Refactor your Dockerfile to consume the equivalent runtime image

Refactoring your Dockerfile may cause issues as the legacy templates contain older versions of Python and R, did not provide a non-root user, and used a different working directory. We maintain a repository that can serve as a reference for how to use our runtime image, you can find that here.

Please note: We require that you use our runtime images as we regularly update the operating system and software

Airflow configuration

IAM

We do not provide a way of reusing the IAM role from Data Engineering Airflow, you will need to populate iam with the same configuration, and update any external references to use the new role format, please refer to workflow identity.

Secrets

We do not provide a way of reusing secrets or parameters from Data Engineering Airflow, you will need to follow workflow secrets, and update your application code to consume the injected variables, or retrieve the value from AWS Secrets Manager (AWS documentation).

Getting help

If you have any questions about Analytical Platform Airflow, please reach out to us on Slack in the #ask-analytical-platform channel.

For assistance, you can raise a support issue.

This page was last reviewed on 2 July 2025. It needs to be reviewed again on 2 January 2026 by the page owner #analytical-platform .