Skip to main content

Security in GitHub

Protecting information in GitHub

GitHub should primarily be used to store code.

Generally, you should not store any data in GitHub, especially sensitive data. Instead, you should use warehouse data sources.

You can use the following approaches to reduce the risk of accidentally publishing sensitive data to GitHub. See also repository visibility and managing access in GitHub.

 
What How it’s configured Reasoning How to override
Publishing data files (.csv, .xlsx, etc.) gitignore file You should not store data in GitHub. Manually add the file using git add -f <filename>
Publishing file archives (.zip, .tar, .7z, .gz, .bz, .rar, etc.) gitignore file It’s better to unzip file archives and commit their raw contents so files can be tracked individually. This helps prevent data from being accidentally published within a file archive. Manually add the file using git add -f <filename>
Publishing large files (>5MB) Pre-commit hook Large files are likely to be data. Do not run pre-commit hooks using git commit --no-verify
Publishing Jupyter notebooks nbstripout as a pre-commit hook Jupyter notebook outputs often contain data. Disable nbstripout using ENABLE_NBSTRIPOUT=false; git commit
Pushing to repositories outside the MoJ Analytical Services GitHub organistion Pre-push hook You should only store code in the MoJ Analytical Services GitHub organisation. Force push using git push -f <remote> <branch>
 

You should also not store secrets in GitHub, including passwords, credentials and keys. You can use parameters to securely store secrets on the Analytical Platform.

Accidentally publishing data to GitHub

If you accidentally publish sensitive data to GitHub, you should:

If you need further support contact the Data Engineering team in the #ask-data-engineering Slack channel.

This page was last reviewed on 30 January 2023. It needs to be reviewed again on 30 January 2024 by the page owner #analytical-platform-support .