Chapter 1 Introduction to Reproducible Research
“Only results that can be replicated are truly scientific results. If there is no chance to replicate research results, they can be regarded as no more than personal views in the opinion or review section of a daily newspaper.” (Huschka 2013)
1.1 What is reproducible research?
Scientific journal editors and research funders are increasingly promoting transparency in research. To encourage the principles of reproducible research, the related institutions are requesting authors to make their research reproducible. To overcome criticisms regarding the validity and power of empirical tests, this means that data and program code need to be shared upon manuscript acceptance, or at earlier stages of the submission process. The purpose is that third-parties have the possibility to reproduce the content, analysis, and conclusions of a study on their own.
Efforts to increase reproducibility have been expressed by institutions within the social sciences including:
- The best practices statement by the Social Science Data Editors,
- In Germany, by the German Research Foundation (DFG) and the Consortium for the Social, Behavioral, Educational and Economic Sciences RatWSD
- In editorial statement of journals, for example, American Economic Review, Management Science or the Journal of International Business Studies (Meyer, Witteloostuijn, and Beugelsdijk 2017; Orozco et al. 2020)
Generally, reproducibility of research can be defined as “the ability of a researcher to duplicate the results of a prior study using the same materials and procedures as were used by the original investigator. In an attempt to reproduce a published statistical analysis, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis to determine whether they yield the same results.” (Bollen et al. 2015). Study results are considered reproducible if after an article publication another researcher can conduct the analyses using identical data and obtain the same results using the material provided.
As empirical research is based on the application of a code to a dataset to answer a pre-defined research question, ensuring the reproducibility includes sharing the data and code to allow others to re-analyze the data and to reproduce the reported results (Orozco et al. 2020). To achieve this, the data and code need to be properly managed while working on a project.
Another concept closely related to reproducibility is the concept of replicability that “refers to the ability of a researcher to duplicate the results of a prior study if the same procedures are followed but new data are collected.” (Bollen et al. 2015). Thus, for example, if an investigator tries to replicate a scientific finding that documents a relationship between two or more variables by using the same scientific methodology but in a new setting, i.e. with new data, and fails to reach a similar conclusion, i.e. replicate it – a failure to replicate occurs. The opposite is said to be true if the results are replicated. Thus, reproducibility and replicability are considered to be the two main elements of empirical research.
To adapt reproducible practices early on, undergraduate and graduate students are encouraged to perform reproducible research in their term papers, theses, and practical applications as soon as possbile. For this reason, university teachers are increasingly asking to submit reproduction material (source data information, data programming and analysis code) at all stages of a study.
The concept of reproducible research is not new and goes back as far as the late 1800s (Vilhuber 2020). However, reproducibility studies have disclosed that many researchers do not follow the principles of reproducible research. At the same time, the increase in availability and use of public and especially non-public data sources, and the increase in the reliance of research methods based on specific software brings the principles of reproducible research on the agenda of many researchers.
The resources boxes in each chapter provide material for best practices in data and code management to perform a reproducible research project, as well as in cleaning the data and conducting data analysis.
1.2 Why you, as a student and researcher, should care about reproducible research?
1.2.1 Avoid common biases that lead to biased or false research results
There is a number of threats to the reproducible research process that can undermine scientific research or lead to false or biased conclusions and publications. Figure 1.1 illustrates the main threats to reproducible science (Munafò et al. 2017).
One of the most fundamental threats when using secondary data is hidden researcher bias which occurs when decisions about data collection, preparation and analysis influence data selection and other steps in such a way that the identified empirical effects considerably change from these decisions (Huntington-Klein et al. 2021).
1.2.2 Increase productivity of your work and the work of the scientific community by performing reproducible research
Following the reproducible research principles allows you to be a part of a good academic practice that strives to improve the quality, efficiency and reliability of scientific research.
1.3 Main steps to generate reproducible research
To follow the best practices of reproducible research, you will need to consider how to perform your project and the steps a reproducible research project contains.
You may follow three main principles to enhance reproducible research (Orozco et al. 2020):
- Organize your work: consider and plan your steps at the beginning of the project
- Code for others: set up each step of your project such that an outsider could follow your documentation
- Automate as much as you can: avoid processing analyses and results using point-and-click softare (MS Excel), export results directly and create a reproducible project documentation.
In this study guide we follow the main steps of performing a reproducible data science project (Bezjak et al. 2018). To make your empirical research study process reproducible, you need to follow these five steps (Figure 1.2):
1.4 Secondary (health care) data
This study guide concentrates on empirical investigation using secondary data, that means data that are not originally collected or generated by the researcher for the purpose of the study. Secondary data cover any existing data generated by companies, institutions, and individuals.
Examples of secondary data sources are routinely collected health data (administrative claims, electronic medical records), bibliometric data, survey data, regulatory data, data generated in mobile applications.
1.5 Creating an environment for productive research projects
1.5.1 Consider using agile methods to organize your workflow and tasks
To set up your project, consider engaging project management tools. One suitable method is working agile or methods based on SCRUM (Pirro 2019; Santiago-Lopez 2019). Originating in software development, working agile is highly suitable in performing research projects. A planning protocol could include the following steps (as adapted from (Pirro 2019)).
Split the work. Slice a big chunk of work into several layers of activities, in which each layer is characterized by a tangible result to be obtained. Each layer is addressed in a dedicated, limited period of time (for example, 2–12 weeks), called a sprint.
Sprint planning. Meet your supervisor and any other stakeholders in a short meeting (around 30 minutes) with the aim of defining the goal of your sprint (for example, what you want to investigate) and its duration (1-4 weeks, for instance). Everybody has to agree on these two points, so that expectations are aligned and the whole research team is on the same page. On this occasion, the sprint-review meeting (see step five) can be scheduled.
Sprint execution. Work! Maximum focus is required on a specific task for a limited amount of time. You can do it, keep momentum.
Weekly scrum. Meet your supervisor for a maximum of 15 minutes ideally every week (for example, the same time slot every week and outside of conventional working hours to ensure there are no commitments, such as meetings or teaching activities, to get in the way). This meeting has to be short and efficient — try to have a stand-up meeting with no laptops or papers. Only three questions need to be addressed: what was done the previous week to contribute to the goal? (For example, which data programs were developed? Which analyses were performed?) What will be done next week to contribute to the goal? (For example, what analyses will be performed next?) And, are there any impediments? (For example, is the set-up working properly? Are all the materials and data needed available?)
Sprint review, retrospective and planning. At the end of the sprint, meet all of the stakeholders to discuss results and whether those are in line with expectations (review). Take some time to go into detail and do some analytical brainstorming together. Discuss the difficulties encountered, so that the next sprint is better than the previous one (retrospective). This is the phase for ‘impediment removal’, or problem solving.
Honesty and transparency are crucial. Agile is all about adapting to change: plans can change. Go back to step one and restart the planning, addressing the next layer of work in a new sprint.
You can execute the same routine even if you work independently.
1.5.2 Collaborate with others and get feedback
Collaborate: Before you start running the study you should assess your available resources, that means time and own expertise to see whether the study design is feasible. If there is an opportunity to work collaboratively and you think that the project can benefit from the expertise of another researcher, you can approach the person and ask whether they would be interested to work together on the project. Joining resources might relax your resource availability constraints. It may further ensure that you engage in a reproducible research process.
Plan to get feedback - create your personal senior advisory board: Besides your collaborators, set up your personal senior advisory board of the project. This could be anybody with experience related to your research project (other students, PhD candidates). Define when you plan to meet your advisors and present the progress of your work. Define the specific competencies with which seniors can help you with your project. This could include advising on study design, methods, programming or interpretation of the results. If you work on a longer-term project, find suitable workshops and conferences to present your work and get feedback. Do not forget to account for time and resources to proof-read and edit your research findings in a paper.
1.6 Resources Box
An overview article of the principles of reproducible research and practical application: Orozco V, Bontemps C, Maigné E, Piguet V, Hofstetter A, Lacroix A, et al. How to Make a Pie: Reproducible Research for Empirical Economics and Econometrics. Journal of Economic Surveys. 2020;34(5):1134–69. https://doi.org/10.1111/joes.12389
A guideline to minimize the risk of reporting false positives (type I errors), improve the quality of hypothesis-testing research and statistical reporting: Meyer, Klaus E., Arjen van Witteloostuijn, and Sjoerd Beugelsdijk. 2017. “What’s in a p? Reassessing Best Practices for Conducting and Reporting Hypothesis-Testing Research.” Journal of International Business Studies 48 (5): 535–51. https://doi.org/10.1057/s41267-017-0078-8.
1.7 Checklist to get started with your reproducible project
- Define the resources available and needed to perform your reproducible research project.
- Set up a working plan that includes important milestones and minimal goals to achieve.
- Decide for a project management method to organize your work, for example, agile methods or SCRUM.
- Define the tools and software for management of references, notes and editing of text and statistical software you will be using. Obtain access if needed.
- Define relevant collaborators and stakeholders critical in persuing your research project.
- Define important milestones and achievements.