Table of Contents | ||
---|---|---|
|
Summary
This case study is a brief report on working with large clinical data (52,491,975 records and 127,922 records) for research into child maltreatment and child mortality. It was clear from initial receipt of the data that analysing the data immediately with STATA and R would be impossible because of the number and size of files.
The work was carried out by the research group and the research centre's own epiLab computing service (MRC Centre of Epidemiology for Child Health Website, 2012).
This document includes:
- A profile of the data
- The challenges
- How using MySQL addresses these problems
- The researchers' perspective on why using MySQL was successful
- Technical limitations of using MySQL with clinical data
- Suggested further investigation
Using MySQL provided a fast, secure way of analysing, cleaning and sharing data and is now a permanent component in the researchers' toolset.
This does not cover the secure environment for processing, SQL tutorials or MySQL configuration guidelines.
Rationale
This case study was undertaken due to a research requirement and will hopefully provide some impetus for others to investigate using databases for data processing. The results are from a use case that continues to use MySQL for data management. This report is not a solution to all data management problems and should raise useful questions for readers.
Context
The data
Hospital Episode Statistics (HES)
The HES data supplied by the NHS Information Centre covers all NHS hospital episodes in England over the financial years 1997 to 2010.
Office for National Statistics (ONS)
The ONS data supplied by the NHS Information Centre covers all registered deaths after on or after 1st January 1998 with a hospital episode registered after the start of the financial year 1997 for ages 30 and under.
Dimensions
Source | Number of records (rows) | Number of variables (fields) | Number of data files | Total size of data files |
---|---|---|---|---|
HES | 52,491,975 | 432 | 14 | 60 gigabytes |
ONS | 126,922 | 27 | 1 | 20 megabytes |
The challenge
- Aggregate the data
- Store the data securely
- Access the data with STATA or R
- Append further datasets as they are received
- Size of the data shouldn't be a worry
- Share the data with other approved users
Results
Researchers' persepective
The following headers are based on quotes from researchers.
Able to manage large datasets that wouldn't have been possible before MySQL
The dataset was supplied as 14 separate text files totalling ~60GB. The data are aggregated into one database table so criteria could be done using SQL whereas the the other option was to load as much data as possible into memory and query each file separately.
Data are maintained in a secure environment where only approved users can share and use the data
The data are stored in a secure computing environment with access to limited to specific users. Even after the users are logged in to the environment MySQL's own authentication and authorisation controls access to the data (MySQL Website, 2012). Data extracts can be shared without having to export files - a user simply authorises another user to access a database table. This table can be directly accessed by STATA or R. MySQL becomes a central repository removing the need to manage folders and files spread over different locations.
Processing time has improved when managing large datasets
Correct use of indexes on variables provides fast performance when making selection criteria. Indexes can avoid the need for MySQL to scan every row for records that match the criteria.
Transferring data between the database and statistical software is simple
To analyse data stored in MySQL STATA and R can connect directly to the database and data tables there is no need for any export to CSV phase. The same is true for loading data into MySQL as STATA has been used to directly export data to MySQL.
SQL is easy to learn and use
SQL is a declarative language and the syntax and semantics can be learned quickly. After tutorials totalling less than two days users were able to run their own queries. The tutorials included MySQL database management - not just the SQL language.
"We have been able to extract and analyse larger subsets of data and progress to other areas of study that we could not with limited computational resources"
Conclusion
From the researchers' satisfaction using MySQL for data management has proved useful. Not only were they able to progress with their work where this would have been difficult but another approach has been added to their analysis arsenal. After approximately 2 days training and occasional support sessions researchers were able to construct database queries to extract subsets of data.
Experienced database users would notice that data stored in MySQL for this use case does not comply to first normal form (1NF) and consequently other levels of normalisation. An example is the DIAG variable which is coded as DIAG01, DIAG02... DIAG20 using a total of 20 columns allowing null values instead of a relation to another table which stores only one DIAG value per row. This structure would not be suitable for dimensions of data where the number of variables exceeds MySQL's design specifications (MySQL Website, 2012). Taking a dataset, decoding the variables and normalising is possible but researchers would then have to de-normalise and recode to be able to use the data.
As mentioned some datasets would have a number of variables that could not immediately be stored using MySQL without restructuring the data. Another approach could start with an initial load of all data to a database such as Cassandra that can have 2 billion variables / columns per row (Cassandra Wiki, 2012) exceeding any requirements I've encountered. An example workflow to access the data would be:
- Specify selection criteria for a subset of variables
- Send the request to a queue to be processed
- The data is exported to MySQL
- The user is notified
- The user connects their statistical software to the new data in MySQL
Although this use case has been a valuable learning experience further investigation and work is required:
- Identify a pattern of working and create abstract reusable guidance on data analysis using databases
- Create an extensible library of any scripts or tools created
- Investigate the use of other databases: Hadoop, Cassandra, Caché...
- Hybrid database approaches - using a BigData database and an ODBC compliant database
- Automation of export, transform and load (ETL) - i.e., using Pentaho Kitchen
- Languages with database support for data analysis: Python with SciPy
Update - 13 May 2013
We are starting a new design phase for this project as the size of the research team has increased and they require a new working environment. The team are more familiar with MySQL and SQL so I will be able to normalise the database design, they are also eager to learn and use Pentaho Data Integration.
New features
- Moving the secure virtual computing environment to an ISO27001 (BSI Group Website, 2013) environment
- Normalising the database structure
- Using Pentaho for transformations / jobs such as denormalising the data for statistical analysis
- Adding more virtual desktops
- Adding Mercurial source control for STATA and R files
Staying the same
- Storing data in MySQL
- Using STATA and R
References
BSI Group Website (2013). ISO/IEC 27001 Information Security. Available at http://www.bsigroup.co.uk/en-GB/iso-27001-information-security/ (Accessed: 13 May 2013)
Cassandra Wiki (2012). Limitations. Available at http://wiki.apache.org/cassandra/CassandraLimitations (Accessed: 1st October 2012)
Hadoop Website (2012). Hadoop. Available at http://hadoop.apache.org/ (Accessed: 1st October 2012)
InterSystems Website (2012). Caché. Available at http://www.intersystems.com/cache/ (Accessed: 1st October 2012)
Mercurial Website (2013). Mercurial SCM. Available at http://mercurial.selenic.com/ (Accessed: 13 May 2013)
MRC Centre of Epidemiology for Child Health Website (2012). epiLab. Available at http://www.ucl.ac.uk/ich/research-ich/mrc-cech/data/epiLab (Accessed: 1st October 2012)
MySQL Website (2012). The MySQL Access Privilege System. Available at https://dev.mysql.com/doc/refman/5.5/en/privilege-system.html (Accessed: 1st October 2012)
MySQL Website (2012). Limits in MySQL. Available at http://dev.mysql.com/doc/refman/5.1/en/limits.html (Accessed: 1st October 2012)
Pentaho Wiki (2012). Kitchen Documentation. Available at http://wiki.pentaho.com/display/EAI/Kitchen+User+Documentation (Accessed: 1st October 2012)
R Website (2013). The R Project for Statistical Computing. Available at http://www.r-project.org/ (Accessed: 13 May 2013)
SciPy (2012). SciPy. Available at http://www.scipy.org/ (Accessed: 1st October 2012)
STATA (2013). STATA: Data Analysis and Statistical Software. Available at http://www.stata.com/ (Accessed: 13th May 2013)