Airflow
This documentation is for the new Analytical Platform Airflow service.
For Data Engineering Airflow, please refer to Data Engineering Airflow.
Overview
Apache Airflow is a workflow management platform for data engineering pipelines.
We recommend using it for long-running or compute intensive tasks.
Pipelines are executed on the Analytical Platform’s Kubernetes infrastructure and can interact with services such as Amazon Athena, Amazon Bedrock, and Amazon S3.
Our Kubernetes infrastructure is connected to the MoJO Transit Gateway, which connects to:
- MoJ Cloud Platform
- MoJ Modernisation Platform
- HMCTS SDP
If you need additional connectivity, submit a feature request.
Please note: You cannot use Analytical Platform Airflow for pipelines using
BashOperator
orPythonOperator
Concepts
We organise Airflow pipelines using environments, projects and workflows:
- Environments are the different stages of infrastructure we provide:
development
,test
andproduction
Please note:
development
is not connected to the MoJO Transit Gateway
Projects are a unit for grouping workflows dedicated to a distinct business domain, service area, or specific project, for example:
BOLD
,HMCTS
orHMPPS
.Workflows are pipelines, also known as DAGs. They consist of a list of tasks organised to reflect the relationships between them. The workflow definition includes additional information, such as your repository name and release tag.
Getting started
Before you can use Airflow, you’ll need to:
Follow the next steps to get started.
Request Airflow access
To access the Airflow components, you’ll need to:
- have a GitHub account (see our Quickstart guide)
- join the
ministryofjustice
GitHub organisation
When you have joined the ministryofjustice
GitHub organisation, submit a request for Airflow access.
After your request is granted, you will be added to a GitHub team that will give you access to our GitHub repository, and AWS environments.
Our team manually approves requests
Once approved, it can take up to three hours to gain access to AWS
Create a GitHub repository
If you already have a repository you’ve used for Airflow, you should create a new one.
Create a repository using one of the provided runtime templates:
You can create this repository in either the
ministryofjustice
ormoj-analytical-services
GitHub organisationRepository standards, such as branch protection, are out of scope for this guidance
For more information on runtime templates, please refer to runtime templates
R (coming soon)
Add your code to the repository
Update the
Dockerfile
instructions to copy your code into the image and install packages required to run
For more information on runtime images, please refer to runtime images
Create a release
Follow GitHub’s documentation on creating a release
After you’ve created a release, check if your container image has been successfully built and published by logging in to the Analytical Platform Common Production AWS account
You can also see our example repository.
Create a project and workflow
To initialise a project, create a directory in the relevant environment in our repository, for example, environments/development/analytical-platform
.
To create a workflow, you need to provide us with a workflow manifest in your project.
This manifest specifies the desired state for the Airflow DAG, and provides contextual information used to categorise and label the DAG.
For example, create environments/development/analytical-platform/example/workflow.yml
, where example
is an identifier for your workflow’s name.
The minimum requirements for a workflow manifest look like this:
dag:
repository: moj-analytical-services/analytical-platform-airflow-python-example
tag: 2.0.0
maintainers:
- jacobwoffenden
tags:
business_unit: Central Digital
owner: analytical-platform@justice.gov.uk
dag.repository
is the name of the GitHub repository where your code is storeddag.tag
is the tag you used when creating a release in your GitHub repositorymaintainers
is a list of GitHub usernames of individuals responsible for maintaining the workflowtags.business_unit
must be one ofCentral Digital
,CICA
,HMCTS
,HMPPS
,HQ
,LAA
,OPG
,Platforms
,Technology Services
tags.owner
must be an email address ending with@justice.gov.uk
Workflow scheduling
There are several options for configuring your workflow’s schedule. By default, if no options are specified, you must manually trigger it in the Airflow console.
The following options are available under dag
:
catchup
: please refer to Airflow’s guidance (defaults tofalse
)depends_on_past
: when set to true, task instances will run sequentially while relying on the previous task’s schedule to succeed (defaults tofalse
)end_date
: the timestamp (YYYY-MM-DD
) that the scheduler won’t go beyond (defaults tonull
)is_paused_upon_creation
: specifies if the dag is paused when created for the first time (defaults tofalse
)max_active_runs
: maximum number of active workflow runs (defaults to1
)retries
: the number of retries that should be performed before failing the task (defaults to0
)retry_delay
: delay in seconds between retries (defaults to300
)schedule
: cron expression that defines how often the workflow runs (defaults tonull
)start_date
: the timestamp (YYYY-MM-DD
) from which the scheduler will attempt to backfill (defaults to2025-01-01
)
The example-schedule
workflow shows an example of setting some of the scheduling options:
dag:
repository: moj-analytical-services/analytical-platform-airflow-python-example
tag: 2.0.0
retries: 3
retry_delay: 150
schedule: "0 8 * * *"
Workflow tasks
Providing the minimum keys under dag
will create a main task. This task will execute the entrypoint of your container and provide a set of default environment variables; for example, in development
:
AWS_DEFAULT_REGION=eu-west-1
AWS_ATHENA_QUERY_EXTRACT_REGION=eu-west-1
AWS_DEFAULT_EXTRACT_REGION=eu-west-1
AWS_METADATA_SERVICE_TIMEOUT=60
AWS_METADATA_SERVICE_NUM_ATTEMPTS=5
AIRFLOW_ENVIRONMENT=DEVELOPMENT
Environment variables
To pass extra environment variables, you can reference them in env_vars
, like this:
dag:
repository: moj-analytical-services/analytical-platform-airflow-python-example
tag: 2.0.0
env_vars:
FOO: "bar"
Compute profiles
We provide a mechanism for requesting minimum levels of CPU and memory from our Kubernetes cluster. You can additionally specify if your workflow should run on on-demand or can run on spot compute (which can be disrupted).
This is done using the compute_profile
key, and by default (if not specified), your workflow task will use general-spot-1vcpu-4gb
, which means:
general
: the compute fleetspot
: the compute type1vcpu
: 1 vCPU is guaranteed4gb
: 4GB of memory is guaranteed
In addition to the general
fleet, we also offer gpu
, which provides your workflow with an NVIDIA GPU.
The full list of available compute profiles can be found here.
Multi-task
Workflows can also run multiple tasks, with dependencies on other tasks in the same workflow. To enable this, specify the tasks
key, for example:
dag:
repository: moj-analytical-services/analytical-platform-airflow-python-example
tag: 2.0.0
env_vars:
FOO: "bar"
tasks:
init:
env_vars:
PHASE: "init"
phase-one:
env_vars:
PHASE: "one"
dependencies: [init]
phase-two:
env_vars:
PHASE: "two"
dependencies: [phase-one]
phase-three:
env_vars:
FOO: "baz"
PHASE: "three"
compute_profile: gpu-spot-1vcpu-4gb
dependencies: [phase-one, phase-two]
Tasks take the same keys (env_vars
and compute_profile
) and can also take dependencies
, which can be used to make a task dependent on other tasks completing successfully.
You can define global environment variables under dag.env_var
, making them available in all tasks. You can then override these by specifying the same environment variable key in the task.
compute_profile
can either be specified at dag.compute_profile
to set it for all tasks, or at dag.tasks.{task_name}.compute_profile
to override it for a specific task.
Workflow identity
By default, for each workflow, we create an associated IAM policy and IAM role in the Analytical Platform’s Data Production AWS account.
The name of your workflow’s role is derived from its environment, project, and workflow: airflow-${environment}-${project}-${workflow}
.
To extend the permissions of your workflow’s IAM policy, you can do so under the top-level iam
key in your workflow manifest, for example:
iam:
athena: write
bedrock: true
glue: true
kms:
- arn:aws:kms:eu-west-2:123456789012:key/mrk-12345678909876543212345678909876
s3_deny:
- mojap-compute-development-dummy/deny1/*
- mojap-compute-development-dummy/deny2/*
s3_read_only:
- mojap-compute-development-dummy/readonly1/*
- mojap-compute-development-dummy/readonly2/*
s3_read_write:
- mojap-compute-development-dummy/readwrite1/*
- mojap-compute-development-dummy/readwrite2/*
s3_write_only:
- mojap-compute-development-dummy/writeonly1/*
- mojap-compute-development-dummy/writeonly2/*
iam.athena
: Can beread
orwrite
, to provide access to Amazon Athenaiam.bedrock
: When set to true, enables Amazon Bedrock accessiam.glue
: When set to true, enables AWS Glueiam.kms
: A list of KMS ARNs used for encrypt and decrypt operations if objects are KMS encryptediam.s3_deny
: A list of Amazon S3 paths to deny accessiam.s3_read_only
: A list of Amazon S3 paths to provide read-only accessiam.s3_read_write
: A list of Amazon S3 paths to provide read-write accessiam.s3_write_only
: A list of Amazon S3 paths to provide write-only access
Advanced configuration
External IAM roles
If you would like your workflow’s identity to run in an account that is not Analytical Platform Data Production, you can provide the ARN using iam.external_role
, for example:
iam:
external_role: arn:aws:iam::123456789012:role/this-is-not-a-real-role
You must have an IAM Identity Provider using the associated environment’s Amazon EKS OpenID Connect provider URL. Please refer to Amazon’s documentation. We can provide the Amazon EKS OpenID Connect provider URL upon request.
You must also create a role that is enabled for IRSA. We recommend using this Terraform module. You must use the following when referencing service accounts:
mwaa:${project}-${workflow}
Workflow secrets
To provide your workflow with sensitive information, such as a username, password or API key, you can pass a list of secret identifiers using the secrets
key in your workflow manifest, for example:
secrets:
- username
- password
- api-key
This will create an encrypted secret in AWS Secrets Manager in the following path: /airflow/${environment}/${project}/${workflow}/${secret_id}
, (which you can populate with the secret value via the AWS console) and it will then be injected into your container using an environment variable, for example:
SECRET_USERNAME=xxxxxx
SECRET_PASSWORD=yyyyyy
Secret names with hyphens (-
) will be converted to use underscores (_
) for the environment variable.
Updating a secret value
Secrets are initially created with a placeholder value. To update this, log in to the Analytical Platform Data Production AWS account and update the value.
Workflow notifications
To enable email notifications, you need to:
Add the following to your workflow manifest:
notifications: emails: - analytical-platform@justice.gov.uk - data-platform@justice.gov.uk
Slack
To enable Slack notifications, you need to:
Add the following to your workflow manifest:
notifications: slack_channel: your-channel-name # e.g. analytical-platform
Invite Analytical Platform’s Slack application (
@Analytical Platform
) to your channel
Workflow logs and metrics
This functionality is coming soon
Accessing the Airflow console
To access the Airflow console, you can use these links:
Runtime templates
We provide repository templates for the supported runtimes:
- Python
- R (coming soon)
These templates include:
- GitHub Actions workflow to build and scan your container for vulnerabilities with Trivy
- GitHub Actions workflow to build and test your container’s structure
- GitHub Actions workflow to perform a dependency review of your repository, if it’s public
- GitHub Actions workflow to build and push your container to the Analytical Platform’s container registry
- Dependabot configuration for updating GitHub Actions, Docker, and dependencies such as Pip
The GitHub Actions workflows call shared workflows we maintain here.
Vulnerability scanning
The GitHub Actions workflow builds and scans your container for vulnerabilities with Trivy, alerting you to any CVEs (Common Vulnerabilities and Exposures) marked as HIGH
or CRITICAL
that have a fix available. You will need to either update the offending package or skip the CVE by adding it to .trivyignore
in the root of your repository.
Configuration testing
To ensure your container is running as the right user, we perform a test using Google’s Container Structure Test tool.
The source for the test can be found here.
Runtime images
We provide container images for the supported runtimes:
- Python
- R (coming soon)
These images include:
- Ubuntu base image
- AWS CLI
- NVIDIA GPU drivers
Additionally, we create a non-root user (analyticalplatform
) and a working directory (/opt/analyticalplatform
).
Migration from Data Engineering Airflow
TBC
Getting help
If you have any questions about Analytical Platform Airflow, please reach out to us on Slack in the #ask-analytical-platform channel.
For assistance, you can raise a support issue.