Skip to main content

Airflow

This documentation is for the new Analytical Platform Airflow service.

For Data Engineering Airflow, please refer to Data Engineering Airflow.

Overview

Apache Airflow is a workflow management platform for data engineering pipelines.

We recommend using it for long-running or compute intensive tasks.

Pipelines are executed on the Analytical Platform’s Kubernetes infrastructure and can interact with services such as Amazon Athena, Amazon Bedrock, and Amazon S3.

Our Kubernetes infrastructure is connected to the MoJO Transit Gateway, which connects to:

  • MoJ Cloud Platform
  • MoJ Modernisation Platform
  • HMCTS SDP

If you need additional connectivity, submit a feature request.

Please note: You cannot use Analytical Platform Airflow for pipelines using BashOperator or PythonOperator

Concepts

We organise Airflow pipelines using environments, projects and workflows:

  • Environments are the different stages of infrastructure we provide: development, test and production

Please note: development is not connected to the MoJO Transit Gateway

  • Projects are a unit for grouping workflows dedicated to a distinct business domain, service area, or specific project, for example: BOLD, HMCTS or HMPPS.

  • Workflows are pipelines, also known as DAGs. They consist of a list of tasks organised to reflect the relationships between them. The workflow definition includes additional information, such as your repository name and release tag.

Getting started

Before you can use Airflow, you’ll need to:

Follow the next steps to get started.

Request Airflow access

To access the Airflow components, you’ll need to:

When you have joined the ministryofjustice GitHub organisation, submit a request for Airflow access.

After your request is granted, you will be added to a GitHub team that will give you access to our GitHub repository, and AWS environments.

Our team manually approves requests

Once approved, it can take up to three hours to gain access to AWS

Create a GitHub repository

If you already have a repository you’ve used for Airflow, you should create a new one.

  1. Create a repository using one of the provided runtime templates:

    You can create this repository in either the ministryofjustice or moj-analytical-services GitHub organisation

    Repository standards, such as branch protection, are out of scope for this guidance

    For more information on runtime templates, please refer to runtime templates

    Python

    R (coming soon)

  2. Add your code to the repository

  3. Update the Dockerfile instructions to copy your code into the image and install packages required to run

For more information on runtime images, please refer to runtime images

Create a release

  1. Follow GitHub’s documentation on creating a release

  2. After you’ve created a release, check if your container image has been successfully built and published by logging in to the Analytical Platform Common Production AWS account

You can also see our example repository.

Create a project and workflow

To initialise a project, create a directory in the relevant environment in our repository, for example, environments/development/analytical-platform.

To create a workflow, you need to provide us with a workflow manifest in your project.

This manifest specifies the desired state for the Airflow DAG, and provides contextual information used to categorise and label the DAG.

For example, create environments/development/analytical-platform/example/workflow.yml, where example is an identifier for your workflow’s name.

The minimum requirements for a workflow manifest look like this:

dag:
  repository: moj-analytical-services/analytical-platform-airflow-python-example
  tag: 2.0.0

maintainers:
  - jacobwoffenden

tags:
  business_unit: Central Digital
  owner: analytical-platform@justice.gov.uk
  • dag.repository is the name of the GitHub repository where your code is stored
  • dag.tag is the tag you used when creating a release in your GitHub repository
  • maintainers is a list of GitHub usernames of individuals responsible for maintaining the workflow
  • tags.business_unit must be one of Central Digital, CICA, HMCTS, HMPPS, HQ, LAA, OPG, Platforms, Technology Services
  • tags.owner must be an email address ending with @justice.gov.uk

Workflow scheduling

There are several options for configuring your workflow’s schedule. By default, if no options are specified, you must manually trigger it in the Airflow console.

The following options are available under dag:

  • catchup: please refer to Airflow’s guidance (defaults to false)
  • depends_on_past: when set to true, task instances will run sequentially while relying on the previous task’s schedule to succeed (defaults to false)
  • end_date: the timestamp (YYYY-MM-DD) that the scheduler won’t go beyond (defaults to null)
  • is_paused_upon_creation: specifies if the dag is paused when created for the first time (defaults to false)
  • max_active_runs: maximum number of active workflow runs (defaults to 1)
  • retries: the number of retries that should be performed before failing the task (defaults to 0)
  • retry_delay: delay in seconds between retries (defaults to 300)
  • schedule: cron expression that defines how often the workflow runs (defaults to null)
  • start_date: the timestamp (YYYY-MM-DD) from which the scheduler will attempt to backfill (defaults to 2025-01-01)

The example-schedule workflow shows an example of setting some of the scheduling options:

dag:
  repository: moj-analytical-services/analytical-platform-airflow-python-example
  tag: 2.0.0
  retries: 3
  retry_delay: 150
  schedule: "0 8 * * *"

Workflow tasks

Providing the minimum keys under dag will create a main task. This task will execute the entrypoint of your container and provide a set of default environment variables; for example, in development:

AWS_DEFAULT_REGION=eu-west-1
AWS_ATHENA_QUERY_EXTRACT_REGION=eu-west-1
AWS_DEFAULT_EXTRACT_REGION=eu-west-1
AWS_METADATA_SERVICE_TIMEOUT=60
AWS_METADATA_SERVICE_NUM_ATTEMPTS=5
AIRFLOW_ENVIRONMENT=DEVELOPMENT

Environment variables

To pass extra environment variables, you can reference them in env_vars, like this:

dag:
  repository: moj-analytical-services/analytical-platform-airflow-python-example
  tag: 2.0.0
  env_vars:
    FOO: "bar"

Compute profiles

We provide a mechanism for requesting minimum levels of CPU and memory from our Kubernetes cluster. You can additionally specify if your workflow should run on on-demand or can run on spot compute (which can be disrupted).

This is done using the compute_profile key, and by default (if not specified), your workflow task will use general-spot-1vcpu-4gb, which means:

  • general: the compute fleet
  • spot: the compute type
  • 1vcpu: 1 vCPU is guaranteed
  • 4gb: 4GB of memory is guaranteed

In addition to the general fleet, we also offer gpu, which provides your workflow with an NVIDIA GPU.

The full list of available compute profiles can be found here.

Multi-task

Workflows can also run multiple tasks, with dependencies on other tasks in the same workflow. To enable this, specify the tasks key, for example:

dag:
  repository: moj-analytical-services/analytical-platform-airflow-python-example
  tag: 2.0.0
  env_vars:
    FOO: "bar"
  tasks:
    init:
      env_vars:
        PHASE: "init"
    phase-one:
      env_vars:
        PHASE: "one"
      dependencies: [init]
    phase-two:
      env_vars:
        PHASE: "two"
      dependencies: [phase-one]
    phase-three:
      env_vars:
        FOO: "baz"
        PHASE: "three"
      compute_profile: gpu-spot-1vcpu-4gb
      dependencies: [phase-one, phase-two]

Tasks take the same keys (env_vars and compute_profile) and can also take dependencies, which can be used to make a task dependent on other tasks completing successfully.

You can define global environment variables under dag.env_var, making them available in all tasks. You can then override these by specifying the same environment variable key in the task.

compute_profile can either be specified at dag.compute_profile to set it for all tasks, or at dag.tasks.{task_name}.compute_profile to override it for a specific task.

Workflow identity

By default, for each workflow, we create an associated IAM policy and IAM role in the Analytical Platform’s Data Production AWS account.

The name of your workflow’s role is derived from its environment, project, and workflow: airflow-${environment}-${project}-${workflow}.

To extend the permissions of your workflow’s IAM policy, you can do so under the top-level iam key in your workflow manifest, for example:

iam:
  athena: write
  bedrock: true
  glue: true
  kms:
    - arn:aws:kms:eu-west-2:123456789012:key/mrk-12345678909876543212345678909876
  s3_deny:
    - mojap-compute-development-dummy/deny1/*
    - mojap-compute-development-dummy/deny2/*
  s3_read_only:
    - mojap-compute-development-dummy/readonly1/*
    - mojap-compute-development-dummy/readonly2/*
  s3_read_write:
    - mojap-compute-development-dummy/readwrite1/*
    - mojap-compute-development-dummy/readwrite2/*
  s3_write_only:
    - mojap-compute-development-dummy/writeonly1/*
    - mojap-compute-development-dummy/writeonly2/*
  • iam.athena: Can be read or write, to provide access to Amazon Athena
  • iam.bedrock: When set to true, enables Amazon Bedrock access
  • iam.glue: When set to true, enables AWS Glue
  • iam.kms: A list of KMS ARNs used for encrypt and decrypt operations if objects are KMS encrypted
  • iam.s3_deny: A list of Amazon S3 paths to deny access
  • iam.s3_read_only: A list of Amazon S3 paths to provide read-only access
  • iam.s3_read_write: A list of Amazon S3 paths to provide read-write access
  • iam.s3_write_only: A list of Amazon S3 paths to provide write-only access

Advanced configuration

External IAM roles

If you would like your workflow’s identity to run in an account that is not Analytical Platform Data Production, you can provide the ARN using iam.external_role, for example:

iam:
  external_role: arn:aws:iam::123456789012:role/this-is-not-a-real-role

You must have an IAM Identity Provider using the associated environment’s Amazon EKS OpenID Connect provider URL. Please refer to Amazon’s documentation. We can provide the Amazon EKS OpenID Connect provider URL upon request.

You must also create a role that is enabled for IRSA. We recommend using this Terraform module. You must use the following when referencing service accounts:

mwaa:${project}-${workflow}

Workflow secrets

To provide your workflow with sensitive information, such as a username, password or API key, you can pass a list of secret identifiers using the secrets key in your workflow manifest, for example:

secrets:
  - username
  - password
  - api-key

This will create an encrypted secret in AWS Secrets Manager in the following path: /airflow/${environment}/${project}/${workflow}/${secret_id}, (which you can populate with the secret value via the AWS console) and it will then be injected into your container using an environment variable, for example:

SECRET_USERNAME=xxxxxx
SECRET_PASSWORD=yyyyyy

Secret names with hyphens (-) will be converted to use underscores (_) for the environment variable.

Updating a secret value

Secrets are initially created with a placeholder value. To update this, log in to the Analytical Platform Data Production AWS account and update the value.

Workflow notifications

Email

To enable email notifications, you need to:

  1. Add the following to your workflow manifest:

    notifications:
      emails:
        - analytical-platform@justice.gov.uk
        - data-platform@justice.gov.uk
    

Slack

To enable Slack notifications, you need to:

  1. Add the following to your workflow manifest:

    notifications:
      slack_channel: your-channel-name # e.g. analytical-platform
    
  2. Invite Analytical Platform’s Slack application (@Analytical Platform) to your channel

Workflow logs and metrics

This functionality is coming soon

Accessing the Airflow console

To access the Airflow console, you can use these links:

Runtime templates

We provide repository templates for the supported runtimes:

These templates include:

  • GitHub Actions workflow to build and scan your container for vulnerabilities with Trivy
  • GitHub Actions workflow to build and test your container’s structure
  • GitHub Actions workflow to perform a dependency review of your repository, if it’s public
  • GitHub Actions workflow to build and push your container to the Analytical Platform’s container registry
  • Dependabot configuration for updating GitHub Actions, Docker, and dependencies such as Pip

The GitHub Actions workflows call shared workflows we maintain here.

Vulnerability scanning

The GitHub Actions workflow builds and scans your container for vulnerabilities with Trivy, alerting you to any CVEs (Common Vulnerabilities and Exposures) marked as HIGH or CRITICAL that have a fix available. You will need to either update the offending package or skip the CVE by adding it to .trivyignore in the root of your repository.

Configuration testing

To ensure your container is running as the right user, we perform a test using Google’s Container Structure Test tool.

The source for the test can be found here.

Runtime images

We provide container images for the supported runtimes:

These images include:

  • Ubuntu base image
  • AWS CLI
  • NVIDIA GPU drivers

Additionally, we create a non-root user (analyticalplatform) and a working directory (/opt/analyticalplatform).

Migration from Data Engineering Airflow

TBC

Getting help

If you have any questions about Analytical Platform Airflow, please reach out to us on Slack in the #ask-analytical-platform channel.

For assistance, you can raise a support issue.

This page was last reviewed on 6 February 2025. It needs to be reviewed again on 6 August 2025 by the page owner #ask-analytical-platform .