Computational notebooks as a tool for productivity, transparency, and reproducibility[Bearbeiten]
The current biodiversity crisis, which threatens 1 million species with extinction (IPBES, 2019), poses the greatest challenge for ecologists who need to understand and predict it, and for conservationists
and politicians who need to manage it. Besides their intrinsic eco-evolutionary value, non-human species provide essential services for humans (IPBES, 2019). For example, insects contribute to crop
pollination (Bartomeus & Dicks, 2019) and pest control (Martin, Reineking, Seo, & Steffan-Dewenter, 2015), while forests store carbon that would otherwise contribute to climate change in the atmosphere (Sullivan et al., 2020). Thus, the understanding and control of biodiversity loss is of upmost importance for human life.
During my PhD thesis I have addressed why some species survive longer than others, and more importantly, how managing biological processes and environmental conditions could improve conservation policies to ultimately avoid extinctions. To tackle this, I develop and use computational simulation models that combine theoretical knowledge and empirical data to simulate ecosystems and draw conclusions and forecasts. Simulation models can be rather complex, and thus simplifications and theoretical assumptions are necessary to assure that models are computationally feasible
and results interpretable. Such assumptions must be made clear when reporting the results of models for careful interpretation by the scientific and public audiences.
The debates surrounding the occurrence and consequences of climate change and, even more timely, the COVID-19 pandemic, have shown the need to carefully present complex research ideas, especially to the general public. Having produced this type content before, to great response from non-scientists, I intend to keep doing it for my research for as long as possible. Most importantly,
throughout my studies, I have come to realize that while final results are well communicated through scientific articles, the raw analytical process, so crucial for trusting the results presented, is often
inaccessible. Colleagues before me came to the same conclusion, which prompted an increasing call for reproducibility in Ecology (Culina, Berg, Evans, & Sánchez-Tójar, 2020; Mislan, Heer, & White,
2016). The increasing complexity and speed of development of project-specific analytical methods requires the establishment of quality standards and review processes (Borregaard & Hart, 2016).
The practice of using computational notebooks can fill this gap.
Computational notebooks are documents containing descriptive text with relevant code and results, combined in a narrative order that documents all stages of the research process (Rule et al., 2019). They can be produced with open-source software, such as Emacs or RStudio, and are to be published along the article they relate to. Despite its advantages, it can seem overwhelming to consider producing such document, since it deceivingly appears to constitute additional work to the already demanding scientific publication process. From my experience, this is not the case. Notebooks actually increase productivity by centralizing writing and analysis, which are then reorganized into traditional publication formats (e.g. main text, figure files, and supplementary material). To popularize this mean of open science, I propose to develop a starter kit that facilitates the use and integration of such notebooks in the publication workflow.
Bartomeus, I., & Dicks, L. V. (2019). The need for coordinated transdisciplinary research infrastructures for pollinator conservation and crop pollination resilience. Environmental Research Letters, 14(4), 045017. doi: 10.1088/1748-9326/ab0cb5
Borregaard, M. K., & Hart, E. M. (2016). Towards a more reproducible ecology. Ecography, 39(4), 349-353. doi: 10.1111/ecog.02493
Culina, A., Berg, I. v. d., Evans, S., & Sánchez-Tójar, A. (2020). Low availability of code in ecology: A
call for urgent action. PLOS Biology, 18(7), e3000763. doi: 10.1371/journal.pbio.300076
IPBES. (2019). Chapter 2.2. Status and trends. In E. S. Brondizio, J. Settele,S. Díaz, & H. T. Ngo (Eds.), Global Assessment on Biodiversity and Ecosystem Services of the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services (IPBES). IPBES Secretariat, Bonn, Germany. Retrieved 2019-06-18, from https://www.ipbes.net/global-assessment-biodiversity-ecosystem-services (Draft chapters)
Martin, E. A., Reineking, B., Seo, B., & Steffan-Dewenter, I. (2015). Pest control of aphids depends on landscape complexity and natural enemy interactions. PeerJ, 3, e1095. doi:10.7717/peerj.1095
Mislan, K. a. S., Heer, J. M., & White, E. P. (2016). Elevating The Status of Code in Ecology. Trends in Ecology & Evolution, 31(1), 4–7. doi: 10.1016/j.tree.2015.11.006
Rule, A., Birmingham, A., Zuniga, C., Altintas, I., Huang, S.-C., Knight, R., . . . Rose, P. W. (2019). Ten simple rules for writing and sharing computational analyses in Jupyter Notebooks. PLOS Computational Biology, 15(7), e1007007. doi:10.1371/journal.pcbi.1007007
Sullivan, M. J. P., Lewis, S. L., Affum-Baffoe, K., Castilho, C., Costa, F., Sanchez, A. C., . . . Phillips, O. L. (2020). Long-term thermal sensitivity of Earth’s tropical forests. Science, 368(6493), 869–874. doi: 10.1126/science.aaw75786
This project addresses an important (and rather common) problem of research transparency and reproducibility in the field of ecology and biodiversity, where mathematical modelling is a go-to research tool. However, the exact computational methodology is often not shared to detail sufficient for reproducibility. Computational notebooks help close this gap by publishing the code, computational outputs and comments. Their use however is not as widespread as it could be, due the perception of the extra work on the part of the researchers preparing a publication. This project uses a two-pronged approach: 1) sharing the specific notebooks along with a scientific publication and 2) development of a "starter kit" that will help other researchers to incorporate this practice into their routine. I find the project proposal relevant to the open science (especially in conjunction with the reproducibility aspect), feasible (the applicant seem to have previous relevant experience that could be used to build on in a reasonable time) and impactful (thanks to the development and promotion of this practice via the "starter kit" in the research community).
A generally convincing proposals with a clear scope and well defined milestones. Yet the main product (documentation of computational notebooks) is a "common product" in that many such documentations already exist and are freely available. As a reviewer I would have liked to read more about what distinguishes the planned documentation of computational notebooks from already existing ones.
Project Book / Projectbuch[Bearbeiten]
Timeline of work
||Justification of delay
|Publication of Figueiredo et al. (under review)
||Rejected and invited for resubmission
|Publication of Figueiredo et al. (in prep.)
|Set up GitHub repository
|Testing basic set up of notebook across different operational systems
|Online workshops to test notebook
||End of July 2021
|Writing of publication of starter kit of notebook
||Start on May 2021
||End of June 2021
|Submission of publication of starter kit of notebook
||End of July 2021
The notebook will be published as a downloadable file (an rmarkdown file, and a jupyter notebook).
The objective is having a document containing a combination of narrative text (main text of the publication) and code (analysis of results).
While the basic layout of such notebook is relatively easy to set up, the biggest challenge is accounting for potential issues that might arise from people using them in different operating systems and software versions.
For that I am currently setting up a minimal set up of both tile types (a rmarkdown file, and a jupyter notebook) and testing the best tools to build them.
The choice of such tools is more diverse in R, which offers different packages (rmarkdown and bookdown, for example).
In addition, software such as pandoc and LaTeX often generate issues related to operation system and/or versions.
Therefore, testing the possible combinations of versions that can successfully produce a notebook is necessary before testing it with different users, which is the next step.
The development is being conducted in a repository which will be made public and attached to the the Freies Wissen main Github account as soon as the publication is accepted.
Over the last month, I have re-assessed the use of Jupyter notebooks, due to some technical issues regarding the reproducibility of such notebooks (Pimentel et al. 2019, Wang et al. 2020).
They will still be included on the starter-kit, but with a discussion of such issues, workarounds, and alternatives, such as the Pluto package for Julia language.
For R code, the best alternative is an RNotebook, for which previous work has been done to generate a reproducible workflow, notable in the form of the template package, the drake package, and the Reproducible Research Project Initialization.
All these projects overlap with my concept of what a notebook should do. My main job is therefore, to combine these tools in a kit that facilitates their use by scientists (biologists and ecologists in particular) with limited experience (and time to learn) such computational methods.
In parallel, I have also progressed on Wikimedia's online courses on Open Science, to complete my understanding of Open Science.
February to June 2021
Over the last four months, I have revised the following list of free, open source tools aimed at improving the reproducibility of computational research in R, Python and Julia languages:
All these tools are well thought out examples, but require the user to either learn a new tool (e.g. workflowr) or do some of the set up by himself. As stated before, for an inexperienced biologist/ecologist, this extra level of work might be enough to deter them from establishing the practice, specially if it constitutes an individual endeavor. Thus, with this in mind, I have established a minimal starter-kit, relying only on the use of Rnotebooks for R and Python and Pluto notebooks for Julia work. The kits are explained in its repository and detailed in Figueiredo et al. (in prep), which also suggests how more advanced can be built once users gain confidence.
Blischak JD, Carbonetto P, and Stephens M. (2019) Creating and sharing reproducible research code the workflowr way. F1000Research, 8(1749). https://doi.org/10.12688/f1000research.20843.1
Borregaard, M. K., & Hart, E. M. (2016). Towards a more reproducible ecology. Ecography, 39(4), 349–353. https://doi.org/10.1111/ecog.02493
Pimentel, João Felipe, Leonardo Murta, Vanessa Braganholo, and Juliana Freire. 2019. “A Large-Scale Study about Quality and Reproducibility of Jupyter Notebooks.” In Proceedings of the 16th International Conference on Mining Software Repositories, 507–17. MSR ’19. Montreal, Quebec, Canada: IEEE Press. https://doi.org/10.1109/MSR.2019.00077.
Wang, Jiawei, Tzu-yang Kuo, Li Li, and Andreas Zeller. 2020. “Restoring Reproducibility of Jupyter Notebooks.” In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Companion Proceedings, 288–89. ICSE ’20. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3377812.3390803.
- Name: Ludmilla Figueiredo
- Institution: Universität Würzburg
- Kontakt: firstname.lastname@example.org