Enabling discoverability of results in R

Gabriel Becker(@groundwalkergmb), Sara Moore, Michael Lawrence

Created: 2017-07-31 Mon 08:40

1 So you have an awesome result …

1.1 That's great

No, really. Good results are hard to find.

2 Two questions

2.1 Is it reproducible?

2.2 How will someone find it?

  1. Your collaborators now
  2. You in a year

3 Let's talk Discoverability

3.1 Discoverability is

the ability to

  • discover the existence of,
  • locate, and
  • retrieve in a useful form

Research, results, and computational artifacts

4 Super-flashy live demo time!

Note: Side-by-side R session and open RSS feed. plot a plot and see it show up 'magically' in the rss feed. Then search for a previous one and grab its code from the metadata stored in the db

5 How do we discover things now?

  • Pubmed/Google Scholar/the library
  • Figshare/Rpubs/Github + nbviewer
  • Email authors/dig around on your hdd

6 It's all about the (meta)data

6.1 Figshare knows it

  • Q: how discoverable is my research?
  • A: It depends on what metadata you tag it with.

From Figshare FAQ (paraphrase)

6.2 Where does metadata come from?

unclesam.jpg

6.3 That's tedious and you probably won't do it

I know

6.4 Sources of automatically inferable metadata

We can infer useful metadata about an object from the

  1. computing environment
  2. object
  3. script
  4. data being analyzed/transformed

6.5 The computing environment

  • Who made the object
  • When
  • What packages were used/loaded

6.6 The object

R objects are necessarily self-describing

  • names, dim, element classes on a data.frame
  • levels on a factor
  • Generally content of the object
    • think dput

6.7 The code

  • Perfect low-level description of
    • how the object was made
    • what it is
  • Often strong hints of
    • goal of analysis
    • thoughts/assumptions of analyst

6.8 The data

What

  • data was analyzed?
  • aspects of the dataset were used?
  • transformations were applied

7 Plots

7.1 How we make plots

questiontoplot.png

7.2 Plot design/choice tells us

  • What analyst thought was important
  • What relationships s/he was looking for

7.3 Plots

pricedens.png carathist.png

7.4 Plots

pricevcarat.png

7.5 Plots

pricevcaratgroup.png pricevcaratgroupfac.png

7.6 Flipping the script

findingplot.png

7.7 Metadata about plots

  • geom/plotting function used
  • vars plotted
  • conditioning/grouping vars
    • levels
  • Titles, axis labels, legend, other text

8 Enough 'theory'! What does it do?

8.1 Store computational artifacts

In an

  • Annotated
  • Searchable
  • Retrievable

Form

8.2 Automatically generate

Low-level semantic annotations

  • Code
    • Tracked automatically
  • Summaries of object

9 How do can you use it?

9.1 Mostly just go about your business

library(recordr)
##
## Your analysis code here
##
record(myplot)

9.2 Code tracking

  • Code tracked automatically while pkg is loaded
    • All recorded objects have full code-provenance

9.3 Automatically recording objects

  • autorecord function, records
    • all lattice/ggplot2 plots when drawn
    • any object passed to summary()

9.4 finding recorded objects

res = vtSearch("mtcars")
res2 = vtSearch("Awesome")

10 Sure, but why?

10.1 Collaboration/reporting

  • RSS feed of plots/results in real time
  • Searchable web frontend
  • Distribution mechanism

10.2 Automatic annotation

  • Plot.ly and figshare support tags
    • But don't generate them!

10.3 Organizational efficiency

  • Has someone in my org already studied this data?
    • what did they find?
    • Maybe I should collab with them.
  • And vice versa

10.4 Organizational safety

  • I have a result and I'm about to make $100M bet on it
    • Are there 9 historical results that disagree?

10.5 Your own sanity

  • Where the @$#% did I put that model/plot/etc I need for this paper?

11 History tracking

11.1 histry package

  • Start tracking code you run when it is loaded
  • R session or knitting documents

11.2 Demo

  • History in a knitr document

12 Availability

12.1 Open source and on Github

13 Acknowledgements

  • Sara Moore
  • Michael Lawrence
  • Biecek et al's archivist package