Skip to main content

Image Pipeline

These instructions show you how to use the use the template github repos to build a Python or R image and save to the Data Engineering ECR.

  1. Create a new GitHub repo using:

    • template-airflow-python if creating a python codebase
    • template-airflow-R if creating an R codebase While creating your own repo using the template of your choice, please ensure moj-analytical-services is designated as the owner of the repo. If your own GitHub account is left as owner, then the GitHub Action used to publish your image will fail when trying to run.
  2. The image will have the same name as the repo name so make sure it is appropriate and reflects the pipeline you intend to run. If you are creating an example pipeline call it airflow-{username}-example

  3. Review the scripts/run.py or scripts/run.R file. This has some code to write and copy to S3. Leave as-is if you are creating an example pipeline. Otherwise replace with your own logic (see Tips on writing the code)

  4. Review the Dockerfile and the parent image and update as necessary (see Dockerfile). Leave as-is if creating the example pipeline

  5. Review the requirements.txt file and update as necessary. Leave as-is if creating the example pipeline. See venv for more details

  6. (For R images only) Review the renv.lock file and update as necessary. Leave as-is if creating the example pipeline. See Renv for more details

  7. Create a tag and release, ensuring Target is set on the main branch. Set the tag and release to v0.0.1 if you are creating an example pipeline

  8. Go to the Actions tab and you should see the “Build, tag, push, and make available image to pods” action running. Make sure the action passes otherwise the image will not be built

  9. If you have permission, log in to ECR and search for your image and tag

Tips on writing the code

You can create scripts in any programming language, including R and Python. You may want to test your scripts in RStudio or JupyterLab on the Analytical Platform before running them as part of a pipeline.

All Python scripts in your Airflow repository should be formatted according to flake8 rules. flake8 is a code linter that analyses your Python code and flags bugs, programming errors and stylistic errors.

You can automatically format your code using tools like black, autopep8 and yapf. These tools are often able to resolve most formatting issues.

Environment Variables

You can use environment variables to pass in variables to the docker container. We tend to write them in caps to point out the fact they will be passed in as environmental variables.

You can use the same Docker image for multiple tasks by passing an environment variable. In the use_kubernetes_pod_operator.py example, we pass in the environment variable “write” and “copy” to first write to S3, then copy the file across, using the same image.

Dockerfile

A Dockerfile is a text file that contains the commands used to build a Docker image. It starts with a FROM directive, which specifies the parent image that your image is based on. Each subsequent declaration in the Dockerfile modifies this parent image. We have a range of parent images to chose from, get in touch if the available images do not meet your requirements.

You can use venv, conda, packrat, renv or other package management tools to capture the dependencies required by your pipeline. If using one of these tools, you will need to update the Dockerfile to install required packages correctly.

(For R images only) If you’re not using Python at all, for example if you’re using Rdbtools and Rs3tools rather than dbtools and botor, then replace Dockerfile with Dockerfile.nopython and delete the requirements.txt file. You can do this by running:

mv Dockerfile Dockerfile.backup
cp Dockerfile.nopython Dockerfile
rm requirements.txt

Test Docker image (optional)

If you have a MacBook, you can use Docker locally to build and test your Docker image. You can download Docker Desktop for Mac here.

To build and test your Docker image locally, follow the steps below:

  1. Clone your Airflow repository to a new folder on your MacBook – this guarantees that the Docker image will be built using the same code as on the Analytical Platform. You may need to create a new connection to GitHub with SSH.

  2. Open a terminal session and navigate to the directory containing the Dockerfile using the cd command.

  3. Build the Docker image by running:

    docker build . -t IMAGE:TAG
    

    where IMAGE is a name for the image, for example, my-docker-image, and TAG is the version number, for example, v0.1.

  4. Run a Docker container created from the Docker image by running:

    docker run IMAGE:TAG
    

    This will run the command specified in the CMD line of the Dockerfile. This will fail if your command requires access to resources on the Analytical Platform, such as data stored in Amazon S3 unless the correct environment variables are passed to the docker container. You would need the following environment variables to ensure correct access to all the AP resources:

    docker run \
        --env AWS_REGION=$AWS_REGION \
        --env AWS_DEFAULT_REGION=$AWS_DEFAULT_REGION \
        --env AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
        --env AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
        --env AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN \
        --env AWS_SECURITY_TOKEN=$AWS_SECURITY_TOKEN \
        IMAGE:TAG
    

    Oher environment variables such as PYTHON_SCRIPT_NAME or R_SCRIPT_NAME can be passed in the same way.

You can start a bash session in a running Docker container for debugging and troubleshooting purposes by running:

docker run -it IMAGE:TAG bash
This page was set to be reviewed before 7 July 2022 by the page owner #ask-data-engineering. This might mean the content is out of date.