Data Management
Scientific data are stored in multiple formats, whether as data, text, images, or other formats. There are several aspects to data management, including a proper management plan, metadata set-up, data quality checks, data storage, and data sharing/access. A well-thought out data structure saves cost for later statistical analysis, see for example "Tidy data."
Overview poster: Scientific data management and sharing
If you need help with your data management or thinking through a data management plan, please contact CSTAT:
The National Information Standards Organization (NISO) defines metadata as "structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource." A data dictionary is an example of variable-level metadata which consists of information such as variable and value labels, data type, measurement unit, plausibility limits, codes for missing values or permitted jumps. Study or process level information relates to the background of a data collection such as the study protocol (e.g., data sources, data collection methods), or recruitment (e.g. target population, inclusion and exclusion criteria, sampling methods, time of data collection).
Common data standards: Sharing data and reusing in other studies can raise syntactic to semantic problems. If researchers convert their local data to a common schema, then it data can be understood by others or analyses can be organized centrally. Examples are OMOP, HL7 FHIR. But the transformation can be complex. Alternatively, datasets may be a collection of common data elements where a standard format is used. For example, ISO 8601 is an international standard for date-time.
Data management may also include a plan on checking on the data quality. For routinely collected data such as electronic health records or administrative databases, data cleaning steps should be implemented regularly and data screening or initial data analyses can be incorporated throughout.
For data sharing it is necessary to augment the dataset with metadata. This can include information about data properties or code how to access the data for analysis. This helps other researchers or your now future Self to use the data in a responsible manner. Data can be published in a citable format (e.g. Anatomy of a data note), and there are discipline specific repositories.
Resources
- Worksheet Dr. Scout Calvert's Research data management worksheet
- NIH DMP format: Template
- NIH: Examples of Data-Sharing Plans
- MSU's links to resources: Data management plans
- ICPSR: Sample Data Management Plan for Social and Political Science Data
- NEH-ODH: Data Management Plans from Successful Grant Applications
- Anonymization of quantitative and qualitative data (UK): Link
- Metadata: data level and study level documentation
References (with CSTAT co-authors):
Baillie M, le Cessie S, Schmidt CO, Lusa L, Huebner M. Ten simple rules for initial data analysis. PLoS Comput Biol 2022; 18(2): e1009819. https://doi.org/10.1371/journal.pcbi.1009819
Schmidt CO, Struckmann S, Enzenbach C, Reineke A, Stausberg J, Damerow S, Huebner M, Schmidt B, Sauerbrei W, Richter A. Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in R. BMC Med Res Meth 2021; 21(1): 1-15. Link
Lusa L.; Huebner M. Organizing and Analyzing Data from the SHARE Study with an Application to Age and Sex Differences in Depressive Symptoms. Int. J. Environ. Res. Public Health 2021, 18, 9684. https://doi.org/10.3390/ijerph18189684
Accompanying repository: Lusa L, Huebner M. Repository with R vignettes to support data organization and analysis for the SHARE study. https://doi.org/10.17605/OSF.IO/KGTX6
Huebner M, le Cessie S, Schmidt CO, Vach W . A contemporary conceptual framework for initial data analysis. Observational Studies 2018; 4: 171-192. https://doi.org/10.1353/obs.2018.0014