Skip to main content

Data FAQs

What types of data are on the Platform?

Data on the Analytical Platform can largely be split into four categories:

  • Raw: data that has been uploaded to the Analytical Platform without any changes made to it
  • Curated: data that has been validated, deduplicated and versioned by the Data Engineers ready to be used by Analytical Platform users
  • Derived: data that has been denormalized, aggregated, and turned into a data model to fit specific needs of Analytical Platform users
  • Processed: data that has been processed by Analytical Platform users to fit their own needs

Where do I find out what data is already on the Platform?

🚩 Note: in the future the Data Catalog will be developed and provide a useful resource for finding data

The Data Engineering and Modelling Team (DMET) maintain a number of databases on the Platform (curated and derived databases). The best way to find out about curated and derived databases is using the data discovery tool (access to the tool is now governed via GitHub; Analytical Platform users have access by default). The data discovery tool can be updated by anyone, so if you find something out about the data that isn’t already documented, please do add to it. Derived tables can also be found by visiting the create-a-derived-table repository. For more information about curated databases please visit #ask-data-engineering and for derived databases please visit #ask-data-modelling.

In addition to this users can create their own S3 buckets which may have processed data useful to other teams, you may have to ask around to see if there is an existing dataset that may suit your needs.

How do I gain access to existing data?

Access to databases is granted via the database access repository. Have a read through the guidance on there. If you are finding the process a little tricky, please ask for help in #ask-data-engineering. A data engineer will happily guide you through things.

If you are looking for access to a user created bucket, then the admin of that bucket should be able to grant you access. If you don’t know who the admin is, or they are not able to grant you access, then ask in the #analytical-platform-support Slack channel or via GitHub. If the bucket admin is unavailable, the Analytical Platform team will need to receive approval from your line manager before you can be given any access to the bucket.

Where should I store my own data?

⚠️ We can only hold 1,000 S3 buckets on the Analytical Platform. Before you create a new bucket, please check whether you can re-use an existing one, or create a domain-level S3 bucket and use folders and path specific access for project level access.

Data should be stored in an S3 bucket. You can create a new S3 bucket in the AP Control Panel. Data can be uploaded manually via the AWS console (which can be accessed through the control panel) or you can write it from RStudio, Visual Studio Code or JupyterLab.

If your data contains anything that could be considered personal information, you must follow guidance from the data protection team which can be found on the intranet.

How do I read/write data from an s3 bucket?

Python: You can read/write directly from S3 using a variety of tools such as pandas, boto3, aws wrangler. However, to get the best representation of the column types in the resulting Pandas dataframe(s), you may wish to use mojap-arrow-pd-parser.

R: Whilst initially the recommended package was botor, you are also encouraged to try out Rs3tools a community maintained, R-native version of S3tools that removes some of the complexity around using Python.

How do I query a database on the Platform?

Databases on the AP use Amazon Athena which allow you to query data using SQL. You shouldn’t need to know about Athena in detail to query databases on the AP, but if you are interested you may wish to read more about it. There are three ways you can query data (there is more detail on all three of these in Data section of this guidance):

The Amazon Athena workbench: If you log into the AWS console and click Services -> Athena, you’ll see the Athena workbench. This is good for testing your queries.

Create a Derived Table: You can also use Create a Derived Table to run SQL statements to query Athena.

If you get assumed-role/... is not authorized to perform: glue:GetDatabases, request database access.

Python: To run queries and/or read data into a pandas DataFrame, use pydbtools. More details are here. Remember to install the latest version!

R: There is currently no single recommended package for querying databases in R. There is dbtools which should work on the “old” platform. Rdbtools should work on the “new” platform. This package is maintained by the analytical platform user community.

What should I use to process my data?

Please refer to tools for processing my data

I am running into memory issues using Python/R, what should I do?

You should do as much data manipulation in Athena/SQL as you possibly can, before reading into your analytical tool of choice. You should consider filtering out unnecessary rows or columns, or aggregating results if appropriate. The function create_temp_table in pydbtools is particularly useful for helping with this.

If the data is stored in your own S3 bucket, you may wish to create your own Athena database.

How do I create my own Athena database?

Athena workbench/Python/R: You can run CREATE DATABASE and CREATE TABLE AS SELECT (CTAS) queries to create your own database and tables from data you have in S3. There are more details in this guidance or you can use what is provided by AWS. When running CTAS queries a key thing to remember is to specify the location (s3 bucket and path) of the data. There is a nice example here of setting up your own database. The tutorial is in python but the SQL can be ran from any tool on the AP.

Create a Derived Table: When you create and run an SQL script with Create a Derived Table it will automatically build your database and table. See the user guidance for more information.

This page was last reviewed on 18 July 2024. It needs to be reviewed again on 18 July 2025 by the page owner #ask-data-engineering .
This page was set to be reviewed before 18 July 2025 by the page owner #ask-data-engineering. This might mean the content is out of date.