Data FAQs

What types of data are on the Platform?

Data on the Analytical Platform can largely be split into four categories:

Raw: data that has been uploaded to the Analytical Platform without any changes made to it
Curated: data that has been validated, deduplicated and versioned by the Data Engineers ready to be used by Analytical Platform users
Derived: data that has been denormalized and turned into a data model to fit the general needs of Analytical Platform users
Processed: data that has been processed by Analytical Platform users to fit their specific needs

Where do I find out what data is already on the Platform?

Check Find MoJ data to find curated and derived databases maintained by the Data Engineering and Modelling Team (DMET).

Users can also create their own S3 buckets which may have processed data useful to other teams. Ask on Slack to see if there’s an existing dataset for your needs.

You can use the data discovery tool to understand how to use maintained databases. For more information about databases, visit #ask-data-engineering for curated and #ask-data-modelling for derived.

How do I gain access to existing data?

Access to databases is granted via the database access repository. Have a read through the guidance on there. If you are finding the process a little tricky, please ask for help in #ask-data-engineering. A data engineer will happily guide you through things.

If you are looking for access to a user-created bucket, then the admin of that bucket should be able to grant you access. If you don’t know who the admin is, or they are not able to grant you access, then raise a ticket via the #analytical-platform-support Slack channel or via GitHub. If the bucket admin is unavailable, the Analytical Platform team will need to receive approval from your line manager before you can be given any access to the bucket.

Where should I store my own data?

⚠️ We can only hold 1,000 S3 buckets on the Analytical Platform. Before you create a new secure data storage folder, please check whether you can re-use an existing one, or create a domain-level S3 bucket and use folders and path-specific access for project level access.

Data should be stored in a Warehouse data source (a folder within an S3 bucket with managed access). You can create a new secure data storage folder through the AP Control Panel ‘Warehouse Data’ page via the ‘Create new warehouse data source’ button. Data can be uploaded manually via the AWS console (which can be accessed through the Control Panel) or you can write to it programmatically using an integrated development environment (IDE) such as RStudio, Visual Studio Code or JupyterLab - see this section for more details.

If your data contains anything that could be considered personal information, you must follow guidance from the data protection team which can be found on the intranet.

How do I read/write data from an s3 bucket?

Python: You can read/write directly from S3 using a variety of tools such as pandas, boto3, aws wrangler. However, to get the best representation of the column types in the resulting Pandas dataframe(s), you may wish to use mojap-arrow-pd-parser.

R: Whilst initially the recommended package was botor, you are also encouraged to try out Rs3tools a community-maintained, R-native version of S3tools that removes some of the complexity around using Python.

How do I query a database on the Platform?

Databases on the AP use Amazon Athena which allow you to query data using SQL. You shouldn’t need to know about Athena in detail to query databases on the AP, but if you are interested you may wish to read more about it. There are three ways you can query data (there is more detail on all three of these in Data section of this guidance):

The Amazon Athena workbench: If you log into the AWS console and click Services -> Athena, you’ll see the Athena workbench. This is good for testing your queries.

Create a Derived Table: You can also use Create a Derived Table to run SQL statements to query Athena.

If you get an error like:

assumed-role/... is not authorized to perform: glue:GetDatabases

you’ll need to request database access.

Python: To run queries and/or read data into a pandas DataFrame, use pydbtools. Remember to install the latest version!

R: There is currently no single recommended package for querying databases in R. There is dbtools which should work on the “old” platform. Rdbtools should work on the “new” platform. This package is maintained by the analytical platform user community.

What should I use to process my data?

Please refer to the tools and services page

I am running into memory issues using Python/R, what should I do?

You should do as much data manipulation in Athena/SQL as you possibly can, before reading into your analytical tool of choice. You should consider filtering out unnecessary rows or columns, or aggregating results if appropriate. The function create_temp_table in pydbtools is particularly useful for helping with this.

If the data is stored in your own S3 bucket, you may wish to create your own Athena database.

How do I create my own Athena database?

Please note: going forward, in order to create your own athena databases, you must specify the naming convention for any databases you create using your project access file in the database access repository, using the key allowed_database_names to either specify an exact name (e.g. finance_dim_dev_dbt) or a naming pattern (e.g. finance_dim_*). An example of this can be seen in this config file, which grants the project members the ability to create a number of databases. Owners of existing projects that rely on this functionality should update their projects accordingly. More info on this can be found in the readme of the database access repository.

Athena workbench/Python/R: You can run CREATE DATABASE and CREATE TABLE AS SELECT (CTAS) queries to create your own database and tables from data you have in S3. There are more details in this guidance or you can use what is provided by AWS. When running CTAS queries a key thing to remember is to specify the location (s3 bucket and path) of the data. here is a nice example of setting up your own database. The tutorial is in python but the SQL can be ran from any tool on the AP.

Create a Derived Table: When you create and run an SQL script with Create a Derived Table it will automatically build your database and table. See the user guidance for more information.

This page was last reviewed on 18 July 2024. It needs to be reviewed again on 18 July 2025 by the page owner #ask-data-engineering .