Gabe Becker
Twitter: @groundwalkergmb, GitHub: gmbecker
^ I collaborate with, but am not a member of, the R-core development team.
Formerly Scientist at Genentech Research
Some content in this talk was developed (by me) while at, and is copyright Genentech, Inc. Used with permission.
And none of those claims were true^
^To the best of our current knowledge
Results we can understand and feel confident using (Gavish)
1 Gavish and Donoho, A Universal Identifier for Computational Results, Procedia Computer Science 4, 2011
Source: Gabriel Becker, copyright Genentech Inc.
It confirms^ the original analyst(s)
And ensures you have the artifact itself
^ technically it does not confirm these in the deductive sense, rather proves them in the "beyond a reasonable doubt" sense.
Very similar (at least) to those gained by manual, strict reproduction,
without requiring us to actually recreate the result at an arbitrary later date.
Source: Gabriel Becker, copyright Genentech Inc.
Source: Gabriel Becker, copyright Genentech Inc.
Source: Gabriel Becker, copyright Genentech Inc.
Reproducibility is great, but by itself it is neither as necessary nor as sufficient as many seem to think.
– Me (and, like, other smart people too probably)
credit: Wickham
Source: Gabriel Becker, copyright Genentech Inc.
Source: Gabriel Becker, copyright Genentech Inc.
Source: Gabriel Becker, copyright Genentech Inc.
Source: Gabriel Becker, copyright Genentech Inc.
^ By how much depends on why the old method fell out of favor
The Currency concern is very real in Bioinformatics, Deep Learning, and other fast moving settings. Its not really a concern when using a "classical" method (GLMs, Random Forests, etc).
[The resulting document] should describe results and lessons learned … as well as a means to reproduce all steps, even those not used in a concise reconstruction, which were taken in the analysis.
– Rossini, Literate Statistical Practice (emphasis mine)
the source or origin of an object; its history and pedigree; a record of the ultimate derivation and passage of an item through its various owners
– Oxford English Dictionary
Paraphrase of Friere et al. in Provenance for Computational Tasks: A Survey, Computing in Science & Engineering. 2008
In theory there is no difference between theory and practice; in practice there is.
– Unattributed
We can
thesis_final_revised_final_v2.Rmd
mydata_BRAF_mut_only.dat
mydata_as_of_2018_03_03.dat
This gets trust in reporting taken care of^ right away, and its super easy.
^absent actual misconduct, ie editing output files manually
You must go watch or read Joel Grus' fantastic talk first
Watch: https://www.youtube.com/watch?v=7jiPeIFXb6U
Read: https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/
Authors for only ~ 44% (36%, 50%) of papers in Science shared both code and data
When to share data is a tricky question.
But answers of "never" and "only once its obsolete/irrelevant" make you the villain of the piece.
As if you, personally, will need to understand, evaluate, use, and extend your result in 5 years, working only from published materials, having lost all personal materials.
Crucial for both multi-analyst collaboration and for strict reproduction
Remember to think about currency
Check out http://mybinder.org
(Even if imperfectly)
Not a simple issue, but ask yourself how can a result be useful to people who can't even read about it?
And whether/how you can
But it is most usefully viewed within a larger, more nuanced context.
And don't be assholes to others who do the same. Even when you find problems in it.
You'd be able to trust if it came out of a lab you don't know.
Gentleman and Temple Lang, Statistical Analyses and Reproducible Research, Bioconductor Working Papers, 2014
Marwick, Boettiger and Mullen, Packaging Data Analytical Work Reproducibly Using R (and Friends), The American Statistician, 2018
Wilkinson et al., The FAIR Guiding Principles for scientific data management and stewardship, Scientific Data, 2016
https://www.force11.org/group/fairgroup/fairprinciples
Dunning, de Smaele and Böhmer, Are the FAIR Data Principles fair? International Journal of Digital Curation, 2017
Marwick, Computational Reproducibility in Archaeological Research: Basic Principles and a Case Study of Their Implementation, Journal of Archaeological Method and Theory, 2016
FitzJohn, Pennell, Zanne and Cornwell, Reproducible research is still a challenge, ROpenSci Blog, 2014 https://ropensci.org/blog/2014/06/09/reproducibility/
ROpenSci, Reproducibility In Science, http://ropensci.github.io/reproducibility-guide/
Basically everything Victoria Stodden has ever published.
Seriously, just go read it (at least the abstracts)
Becker, Moore and Lawrence, trackr: A Framework for Enhancing Discoverability and Reproducibility of Data Visualizations and Other Artifacts in R, Journal of Computational and Graphical Statistics, 2019
Biecek and Kosiński, archivist: An R Package for Managing, Recording and Restoring Data Analysis Results, Journal of Statistical Software, 2017
Landau, The drake R package: a pipeline toolkit for reproducibility and high-performance computing. The Journal of Open Source Software, 2018