Airflow
This documentation is for the new Analytical Platform Airflow service.
For Data Engineering Airflow, please refer to Data Engineering Airflow.
Overview
Apache Airflow is a workflow management platform. Analytical Platform users primarily use it for:
- automating data engineering pipelines
- training machine learning models
- reproducible analytical pipelines (RAP)
We recommend using it for long-running or compute intensive tasks.
Workflows are executed on the Analytical Platform’s Kubernetes infrastructure and can interact with services such as Amazon Athena, Amazon Bedrock, and Amazon S3.
Our Kubernetes infrastructure is connected to the MoJO Transit Gateway, which connects to:
- MoJ Cloud Platform
- MoJ Modernisation Platform
- HMCTS SDP
If you need additional connectivity, submit a feature request.
Please note: You cannot use Analytical Platform Airflow for workflows using
BashOperator
orPythonOperator
Concepts
The Analytical Platform Airflow is made up of environments, projects and workflows:
- Environments are the different stages of infrastructure we provide:
development
,test
andproduction
.
Please note:
development
is not connected to the MoJO Transit Gateway
Projects are a unit for grouping workflows dedicated to a distinct business domain, service area, or specific project, for example:
BOLD
,HMCTS
orHMPPS
.Workflows are pipelines, also known as directed acyclic graph (DAGs). They consist of a list of tasks organised to reflect the relationships between them. The workflow definition includes additional information, such as your repository name and release tag.
Getting started
Before you can use Airflow, you’ll need to:
- request Airflow access
- create a GitHub repository
- create a GitHub release
- create an Airflow pipeline project and workflow
Follow the next steps to get started.
Request Airflow access
To access the Airflow components, you’ll need to:
- have a GitHub account (see our Quickstart guide)
- join the
ministryofjustice
GitHub organisation
If you are a member of Data Engineering’s GitHub team (@ministryofjustice/data-engineering), you are automatically granted access and do not need to submit a request
When you have joined the ministryofjustice
GitHub organisation, submit a request for Airflow access.
After your request is granted, you will be added to a GitHub team that will give you access to our GitHub repository, and AWS environments.
Our team manually approves requests
Once approved, it can take up to three hours to gain access to AWS
Create a GitHub repository
If you already have a repository you’ve used for Data Engineering Airflow, please refer to migrating from Data Engineering Airflow
- Create a repository using one of the provided runtime templates:
You can create this repository in either the
ministryofjustice
ormoj-analytical-services
GitHub organisationRepository standards, such as branch protection, are out of scope for this guidance
For more information on runtime templates, please refer to runtime templates
2. Add your code to the repository, including the script(s) your want Airflow to run and a file for your package management
3. Update the Dockerfile
instructions to copy your code into the image, install packages required to run, and call the script(s) to run. For example, for Python:
FROM ghcr.io/ministryofjustice/analytical-platform-airflow-python-base:1.7.0@sha256:5de4dfa5a59c219789293f843d832b9939fb0beb65ed456c241b21928b6b8f59
USER root
COPY requirements.txt requirements.txt
COPY src/ .
RUN pip install -r requirements.txt
USER ${CONTAINER_UID}
ENTRYPOINT ["python3", "main.py"]
For more information on runtime images, please refer to runtime images
Create a GitHub release
Follow GitHub’s documentation on creating a release. Make note of the release tag.
After you’ve created a release, check if your container image has been successfully built and published by logging in to the Analytical Platform Common Production AWS account
You can also see our example Python repository.
Create an Airflow pipeline project and workflow
To initialise an Airflow pipeline project, create a directory in the relevant environment in the Airflow repository, for example, environments/development/analytical-platform
.
To create a Airflow pipeline workflow (a DAG), you need to provide a workflow manifest (workflow.yml
) file in your project under a workflow identifier name.
This manifest file specifies the desired state for workflow, and provides contextual information used to categorise and label the workflow.
For example, create environments/development/analytical-platform/example/workflow.yml
, where example
is an identifier for your workflow’s name.
The minimum requirements for a workflow manifest look like this:
dag:
repository: moj-analytical-services/analytical-platform-airflow-python-example
tag: 2.0.0
maintainers:
- jacobwoffenden
tags:
business_unit: Central Digital
owner: analytical-platform@justice.gov.uk
dag.repository
is the name of the GitHub repository where your code is stored and release has been createddag.tag
is the tag you used when creating the release in your GitHub repositorymaintainers
are a list of GitHub usernames of individuals responsible for maintaining the workflow, and updating any secret valuestags.business_unit
must be one ofCentral Digital
,CICA
,HMCTS
,HMPPS
,HQ
,LAA
,OPG
,Platforms
,Technology Services
tags.owner
must be an email address ending with@justice.gov.uk
Providing a
tags.business_unit
other thanCentral Digital
,CICA
,HMCTS
,HMPPS
,HQ
,LAA
,OPG
,Platforms
,Technology Services
will result in an error.
Workflow scheduling
There are several options for configuring your workflow’s schedule. By default, if no options are specified, you must manually trigger it in the Airflow console.
The following options are available under dag
:
catchup
: please refer to Airflow’s guidance (defaults tofalse
)depends_on_past
: when set to true, task instances will run sequentially while relying on the previous task’s schedule to succeed (defaults tofalse
)end_date
: the timestamp (YYYY-MM-DD
) that the scheduler won’t go beyond (defaults tonull
)is_paused_upon_creation
: specifies if the dag is paused when created for the first time (defaults tofalse
)max_active_runs
: maximum number of active workflow runs (defaults to1
)retries
: the number of retries that should be performed before failing the task (defaults to0
)retry_delay
: delay in seconds between retries (defaults to300
)schedule
: cron expression that defines how often the workflow runs (defaults tonull
)start_date
: the timestamp (YYYY-MM-DD
) from which the scheduler will attempt to backfill (defaults to2025-01-01
)
The example-schedule
workflow shows an example of a workflow that runs at 08:00 every day and retries 3 times, with a 150 second delay between each retry:
dag:
repository: moj-analytical-services/analytical-platform-airflow-python-example
tag: 2.0.0
retries: 3
retry_delay: 150
schedule: "0 8 * * *"
Workflow tasks
Providing the minimum keys under dag
will create a main task. This task will execute the entrypoint of your container and provide a set of default environment variables; for example, in development
:
AWS_DEFAULT_REGION=eu-west-1
AWS_ATHENA_QUERY_EXTRACT_REGION=eu-west-1
AWS_DEFAULT_EXTRACT_REGION=eu-west-1
AWS_METADATA_SERVICE_TIMEOUT=60
AWS_METADATA_SERVICE_NUM_ATTEMPTS=5
AIRFLOW_ENVIRONMENT=DEVELOPMENT
Environment variables
To pass extra environment variables, you can reference them in env_vars
, like this:
dag:
repository: moj-analytical-services/analytical-platform-airflow-python-example
tag: 2.0.0
env_vars:
x: "1"
Compute profiles
We provide a mechanism for requesting minimum levels of CPU and memory from our Kubernetes cluster. You can additionally specify if your workflow should run on on-demand or can run on spot compute (which can be disrupted).
This is done using the dag.compute_profile
key, and by default (if not specified), your workflow task will use general-spot-1vcpu-4gb
, which means:
general
: the compute fleetspot
: the compute type1vcpu
: 1 vCPU is guaranteed4gb
: 4GB of memory is guaranteed
In addition to the general
fleet, we also offer gpu
, which provides your workflow with an NVIDIA GPU pre-installed with CUDA.
The full list of available compute profiles can be found here.
Analytical Platform tooling (such as JupyterLab, RStudio and Visual Studio Code) has access to 1 vCPU and 12GB RAM. The closest compute profile is
general-on-demand-4vcpu-16gb
.
Multi-task
Workflows can also run multiple tasks, with dependencies on other tasks in the same workflow. To enable this, specify the tasks
key, for example:
dag:
repository: moj-analytical-services/analytical-platform-airflow-python-example
tag: 2.0.0
env_vars:
x: "1"
tasks:
init:
env_vars:
y: "0"
phase-one:
env_vars:
y: "1"
compute_profile: cpu-spot-2vcpu-8gb
dependencies: [init]
phase-two:
env_vars:
y: "2"
compute_profile: gpu-spot-1vcpu-4gb
phase-three:
env_vars:
x: "2"
y: "3"
dependencies: [phase-one, phase-two]
Tasks take the same keys (env_vars
and compute_profile
) and can also take dependencies
, which can be used to make a task dependent on other tasks completing successfully.
You can define global environment variables under dag.env_vars
, making them available in all tasks. You can then override these by specifying the same environment variable key in the task.
compute_profile
can either be specified at dag.compute_profile
to set it for all tasks, or at dag.tasks.{task_name}.compute_profile
to override it for a specific task.
Workflow identity
By default, for each workflow, we create an associated IAM policy and IAM role in the Analytical Platform’s Data Production AWS account.
The name of your workflow’s role is derived from its environment, project, and workflow: airflow-${environment}-${project}-${workflow}
.
To extend the permissions of your workflow’s IAM policy to include access to Athena, Bedrock, Glue, KMS ARNs and/or S3 buckets, you can do so under the top-level iam
key in your workflow manifest, for example:
iam:
athena: write
bedrock: true
glue: true
kms:
- arn:aws:kms:eu-west-2:123456789012:key/mrk-12345678909876543212345678909876
s3_deny:
- mojap-compute-development-dummy/deny1/*
- mojap-compute-development-dummy/deny2/*
s3_read_only:
- mojap-compute-development-dummy/readonly1/*
- mojap-compute-development-dummy/readonly2/*
s3_read_write:
- mojap-compute-development-dummy/readwrite1/*
- mojap-compute-development-dummy/readwrite2/*
s3_write_only:
- mojap-compute-development-dummy/writeonly1/*
- mojap-compute-development-dummy/writeonly2/*
iam.athena
: Can beread
orwrite
, to provide access to Amazon Athenaiam.bedrock
: When set to true, enables Amazon Bedrock accessiam.glue
: When set to true, enables AWS Glueiam.kms
: A list of KMS ARNs used for encrypt and decrypt operations if objects are KMS encryptediam.s3_deny
: A list of Amazon S3 paths to deny accessiam.s3_read_only
: A list of Amazon S3 paths to provide read-only accessiam.s3_read_write
: A list of Amazon S3 paths to provide read-write accessiam.s3_write_only
: A list of Amazon S3 paths to provide write-only access
Advanced configuration
External IAM roles
If you would like your workflow’s identity to run in an account that is not Analytical Platform Data Production, you can provide the ARN using iam.external_role
, for example:
iam:
external_role: arn:aws:iam::123456789012:role/this-is-not-a-real-role
You must have an IAM Identity Provider using the associated environment’s Amazon EKS OpenID Connect provider URL. Please refer to Amazon’s documentation. We can provide the Amazon EKS OpenID Connect provider URL upon request.
You must also create a role that is enabled for IRSA. We recommend using this Terraform module.
You must use the following when referencing service accounts:
mwaa:${project}-${workflow}
Workflow secrets
To provide your workflow with sensitive information, such as a username, password or API key, you can pass a list of secret identifiers using the secrets
key in your workflow manifest, for example:
secrets:
- username
- password
- api-key
This will create an encrypted secret in AWS Secrets Manager in the following path: /airflow/${environment}/${project}/${workflow}/${secret_id}
, (which you can populate with the secret value via the AWS console) and it will then be injected into your container using an environment variable, for example:
SECRET_USERNAME=xxxxxx
SECRET_PASSWORD=yyyyyy
SECRET_API_KEY=zzzzzz
Secret names with hyphens (-
) will be converted to use underscores (_
) for the environment variable.
Updating a secret value
Secrets are initially created with a placeholder value. To update this, your GitHub username must be listed in the maintainers
section of the workflow manifest file, and then log in to the Analytical Platform Data Production AWS account and update the value.
Workflow notifications
To enable email notifications, add the following to your workflow manifest:
notifications:
emails:
- analytical-platform@justice.gov.uk
- data-platform@justice.gov.uk
Slack
To enable Slack notifications, you need to:
Add the following to your workflow manifest:
notifications: slack_channel: your-channel-name # e.g. analytical-platform
Invite Analytical Platform’s Slack application (
@Analytical Platform
) to your channel
Workflow logs and metrics
This functionality is coming soon
Accessing the Airflow console
To access the Airflow console, you can use these links:
Runtime templates
We provide repository templates for the supported runtimes:
These templates include:
- GitHub Actions workflow to build and scan your container for vulnerabilities with Trivy
- GitHub Actions workflow to build and test your container’s structure
- GitHub Actions workflow to perform a dependency review of your repository, if it’s public
- GitHub Actions workflow to build and push your container to the Analytical Platform’s container registry
- Dependabot configuration for updating GitHub Actions, Docker, and dependencies such as Pip
The GitHub Actions workflows call shared workflows we maintain here.
Vulnerability scanning
The GitHub Actions workflow builds and scans your container for vulnerabilities with Trivy, alerting you to any CVEs (Common Vulnerabilities and Exposures) marked as HIGH
or CRITICAL
that have a fix available. You will need to either update the offending package or skip the CVE by adding it to .trivyignore
in the root of your repository.
Configuration testing
To ensure your container is running as the right user, we perform a test using Google’s Container Structure Test tool.
The source for the test can be found here.
Runtime images
We provide container images for the supported runtimes:
These images include:
- AWS CLI
- NVIDIA GPU drivers
Additionally, we create a non-root user (analyticalplatform
) and a working directory (/opt/analyticalplatform
).
Installing system packages
Our runtime images are set to run as a non-root user (analyticalplatform
) which cannot install system packages.
To install system packages, you will need to switch to root
, perform any installations, and switch back to analyticalplatform
, for example:
FROM FROM ghcr.io/ministryofjustice/analytical-platform-airflow-python-base:1.6.0
USER root # Switch to root
RUN <<EOF
apt-get update # Refresh APT package lists
apt-get install --yes ${PACKAGE} # Install packages
apt-get clean --yes # Clear APT cache
rm --force --recursive /var/lib/apt/lists/* # Clear APT package lists
EOF
USER ${CONTAINER_UID} # Switch back to analyticalplatform
Migrating from Data Engineering Airflow
GitHub repository
If you have an existing repository that was created using moj-analytical-services/template-airflow-python or moj-analytical-services/template-airflow-r, you need to perform the following actions:
Remove
.github/workflows/ecr_push.yml
Add the GitHub Actions workflows (
.github/workflows
) from the equivalent runtime templateAdd the Dependabot configuration (
.github/dependabot.yml
) from the equivalent runtime templateRefactor your Dockerfile to consume the equivalent runtime image
Refactoring your Dockerfile may cause issues as the legacy templates contain older versions of Python and R, did not provide a non-root user, and used a different working directory. We maintain a repository that can serve as a reference for how to use our runtime image, you can find that here.
Please note: We require that you use our runtime images as we regularly update the operating system and software
Airflow configuration
IAM
We do not provide a way of reusing the IAM role from Data Engineering Airflow, you will need to populate iam
with the same configuration, and update any external references to use the new role format, please refer to workflow identity.
Secrets
We do not provide a way of reusing secrets or parameters from Data Engineering Airflow, you will need to follow workflow secrets, and update your application code to consume the injected variables, or retrieve the value from AWS Secrets Manager (AWS documentation).
Getting help
If you have any questions about Analytical Platform Airflow, please reach out to us on Slack in the #ask-analytical-platform channel.
For assistance, you can raise a support issue.