Security in GitHub
Protecting information in GitHub
GitHub should primarily be used to store code.
Generally, you should not store any data in GitHub, especially sensitive data. Instead, you should use warehouse data sources.
|What||How it’s configured||Reasoning||How to override|
|Publishing data files (.csv, .xlsx, etc.)||gitignore file||You should not store data in GitHub.||Manually add the file using
|Publishing file archives (.zip, .tar, .7z, .gz, .bz, .rar, etc.)||gitignore file||It’s better to unzip file archives and commit their raw contents so files can be tracked individually. This helps prevent data from being accidentally published within a file archive.||Manually add the file using
|Publishing large files (>5MB)||Pre-commit hook||Large files are likely to be data.||Do not run pre-commit hooks using
|Publishing Jupyter notebooks||nbstripout as a pre-commit hook||Jupyter notebook outputs often contain data.||Disable nbstripout using
|Pushing to repositories outside the MoJ Analytical Services GitHub organistion||Pre-push hook||You should only store code in the MoJ Analytical Services GitHub organisation.||Force push using
You should also not store secrets in GitHub, including passwords, credentials and keys. You can use parameters to securely store secrets on the Analytical Platform.
Accidentally publishing data to GitHub
If you accidentally publish sensitive data to GitHub, you should:
- follow the GitHub guidance on removing sensitive data from a repository
- report a security incident and follow any instructions given by the security team
If you need further support contact the Data Engineering team in the #ask-data-engineering Slack channel.