Reproducible computational research
by Guido España
May 16, 2019
Introduction
There are several difficulties for researchers who try to replicate findings from other researchers. For instance, the instructions could be incomplete, and the code and data might not be available. ELife’s first computational reproducible article aims to solve this issue by allowing readers to interact with code that generates every figure in the paper. This open-source project (Stencila) is a step towards reproducible research. However, there are other obstacles that authors need to address in their work habits to be able to produce research findings that can be reproduced by others, but more critically, by themselves.
Five obstacles towards reproducible research
Publishing a manuscript is an iterative process with several steps that go from a preliminary idea, an initial draft, and to, hopefully, publication. This long process involves many changes in the manuscript, code, and figures. These changes are due to numerous revisions from collaborators and journal reviewers, or from the researchers themselves. Many things can get on the way of reproducibility. Personally, I have experienced problems with keeping results up to date because often there are misconnections or manual steps in the process of gathering and analyzing data, and creating figures or tables. It is particularly difficult to remember all the steps to create a figure several months after the first draft of the manuscript.
In my opinion, there are five main obstacles in the path of reproducible computational research:
- Keeping results up to date
- Remembering previous versions of the manuscripts
- Collaborating with co-authors
- Responding to reviewers
- Sharing reproducible research with other researchers
File management
The first step towards reproducibility is file management. It is very important to name files in such a way that they are easy to understand for humans and computers. Files also need to be easy to recall. For humans, this means that names should be meaningful, for computers, this means that files should not contain strange characters (spaces, commas, periods, or others like #$%^). Directory structure is important to keep research files organized. For instance, to differenciate the manuscript files from the scripts and data, one could have a main directory (projectX
) and two sub directories: projectX/manuscript
and projectX/analysis
). A couple of other files are important to have as well: a README and a Makefile. A README file explains the overall structure of the project and the steps to reproduce the manuscript. The Makefile connects all the files necessary to generate the manuscript, so that the document and its results are always up to date.
Make files
GNU make is a tool to compile files (often executable files) that depend on many other files. GNU make is often used to compile C code. One of its main features is that it does not compile all the files but only those that have changed or whose dependencies have changed. GNU make files can be used to connect all the necessary files of a manuscript, such as those used to collect data, create figures, or even, the manuscript itself in LaTeX. Given that only files that depend on recently changed files are compiled, GNU make files can save time in producing an always updated manuscript. You could find many tutorials online. This one from Karl Broman is a pretty good introduction.
Using Git + Latex / knitr
Keeping track of document versions is vital for reproducible research because one should be able to get old versions of the manuscript. It seems like the common practice is to keep track of versions by naming files like manuscriptFinal.docx, manuscriptFinal2.docx, manuscriptFinalVersion2-1.docx, manuscriptFinalVersionReal.docx, etc. There are more elegant solutions, such as including the date of the document: 20190101_manuscript_demo.docx
. An even better solution is to use Git and LaTeX. Git is a distributed version control tool to manage projects, designed to track changes and to allow many collaborators to contribute to a project. LaTeX is a plain-text typesetting system that is mainly used for technical writing.
For a manuscript, git can help keep track of changes in the code and in the text. Given that LaTeX is written in plain text, git can be used to detect changes among versions of a manuscript. While MS Word has a handy tool to keep track of changes, this can increase the size of the file to the point of slowing down its performance. Moreover, track changes in MS Word isn’t helpful to keep track of document versions. Conversely, git provides tools helpfu to remember different versions of the manuscript (git tag) and to collaborate with others (git branch, git merge). There are some tutorials online that explore the capabilities of git + LaTeX. This quickstart guide for LaTeX and Git is a good start.
In addition to plain LaTeX files, LaTeX documents can be written with Sweave/knitr to create manuscripts that are fully connected with the code for figures and tables, keeping the manuscript always updated. Knitr can also be combined with GitHub to keep track for versions. Christopher Gandrud’s book on reproducible research is a very good guide to learn the basics of knitr, LaTeX, and reproducible research with R. For emacs enthusiasts, org-mode is a powerful alternative. See this tutorial for an introduction to manuscript writing with org-mode.
Pandoc
Using LaTeX (or knitr, or org-mode) would be ideal for anyone collaborating with a team of LaTeX users. However, this isn’t always the case. In many fields, MS Word is the prevalent word processor. If you want to use the capabilities of git and LaTeX, but you still want to share word documents with others, pandoc is probably the best alternative to convert between .tex and .docx documents. A brief introduction for this purpose is Alexander Branham’s guide to pandoc. Using pandoc, LaTeX documents can be converted to MS Word to share with collaborators, then their changes can be incorporated into the main LaTeX document.
GitHub/GitLab
GitHub or GitLab can be used to create a remote repository and share it with collaborators. If the workflow to create the manuscript has been automated, then sharing with others should be straightforward with GitHub or GitLab, where other researchers could replicate the manuscript results and use them in their own research. To share with the scientific community, a manuscript published in a scientific journal could include the URL for the GitHub repository with all the necessary steps to create the manuscript.
Other resources
This post is a brief summary of tools useful for reproducible computational research. A lot of the content on this post comes from Gandrud’s book on reproducible research. I also created some slides and some demos of reproducible manuscripts using the tools mentioned here: LaTeX, knitr, and org-mode. You can download the slides here.
Furthermore, there are many more materials online that are helpful:
- I found that the book by Christopher Gandrud to be an excellent resource. Available here with the source code available here.
- Another resource for GNU make.
- Rob Hyndman’s post about makefiles with LaTeX.
- Pull it together is a very good example of how to automate the process of writing papers with GNU Make.
- Workflow example of git + LaTeX
- Git tutorial
- Another org-mode guide for writing scientific papers