Existing open source technologies enable researchers to conveniently manage and execute reproducible and shareable workflows in empirical research projects: For example, R Markdown can be used to create reproducible documents due its ability to execute code for data analysis and Markdown language in the same file. GitHub is a convenient solution to store and version source code and Open Science Framework provides a platform for all other things related to Open Science, such as pre-registrations or sharing data with colleagues or reviewers.
However, one step that is missing from current workflows is version control of the underlying data. Real world data, such as logs, comments, and reviews, are usually messy and require substantial processing. Additionally, this data may be updated during the project, e.g., when data collection is ongoing (as is the case for many COVID-19 working papers). This can result in many different versions of datasets throughout the research project (e.g., v2, v3, featureXpresent, featureYabsent) and may lead to confusing and hardly reproducible results when looking at an analysis that has been conducted a few weeks back. Moreover, the data that researchers in social sciences and other fields have at hand become increasingly large, often exceeding several GBs. Simply running the entire pre-processing pipeline when compiling an R Markdown document can take hours to complete. Versioning intermediate datasets with Git is often not feasible due to storage restrictions of code repositories and technical difficulties to efficiently handle large files.
Several open source tools promise to alleviate that problem by offering ways to version intermediate datasets and statistical models. One tool that has recently gained attention is Data Version Control (DVC), which creates metadata about file versions that can be tracked via Git. Other tools, such as Dolt or Git Large File Storage promise similar benefits. However, it is unclear whether versioning data in an empirical research project, as opposed to more complex and dynamic machine learning environments in companies, is beneficial for transparency and easily applicable by other researchers with less technical know-how. Furthermore, little insights exist as to whether data versioning facilitates sharing of intermediate results and allow outsiders to gain a better understanding of the process.
My project has three main goals: First, I plan to evaluate suitable tools for data versioning in empirical research projects along several dimensions (e.g., transparency, reproducibility, complexity, flexibility, costs). Second, as a proof of concept, I want to incorporate the most promising tool into my on-going research project on newcomer integration in online communities. Third, I plan to share my experiences about the process in a blog and offer an online appendix to the research project that enables other researchers to run analyses on different versions of the data.
Eine Übersicht und Evaluierung von Tools zur Versionierung von Daten ist für viele Fachdisziplinen relevant. Das konkrete Vorgehen ist klar formuliert. Der Beitrag zu Wikimedia-Projekten könnte ausgeweitet werden. In Frage käme beispielsweise ein entsprechender Wikipedia-Artikel mit einer tabellarischen Übersicht der Ergebnisse, wobei die Tabelle sich aus Einträgen in Wikidata speist. Neben einer reinen Evaluierung wäre auch ein theoretischer Rahmen wünschenswert. Hierbei könnten beispielsweise auch die Aktivitäten der Research Data Alliance (RDA) zum Thema Datenversionierung (Data Versioning WG) berücksichtigt werden.
Version control for large data is an important topic and it'd be helpful to other scientists to have an overview and comparison about the available options. My only reservation for this project is that the project seems relatively small in scope and I am unsure how much it would benefit from attaining a "fellowship" status.
Datenmanagementplan (in progress)
- Name: Florian Pethig
- Institution: Universität Mannheim
- Kontakt: firstname.lastname@example.org