The importance of open data science tools in science: a list of references

by Julie Lowndes

Open data science is the combination of open science, data science, and team science. We promote and practice this with the Ocean Health Index by using software tools that make our work transparent, reproducible, and shared publicly online. These concepts are being discussed more and more in the news and in the academic literature. There are so many great articles on these topics that we have started this list, along with a quote from the article to give a sense of its content.

**We are trying to make this list more comprehensive; please help! You can suggest additions on twitter at @OHIScience, or through a pull request on GitHub. We will be organizing the list as it grows. Thanks for your help!

From the literature

  • Scientific computing: Code alert (Baker, 2017, Nature)
    • “It can be ‘really intimidating’ to learn a programming language, but the long-term benefits are well worth the effort”
  • 1,500 scientists lift the lid on reproducibility (Baker, 2016, Nature)
    • “Survey sheds light on the ‘crisis’ rocking research.”
  • My digital toolbox: Julia Stewart Lowndes (Perkel 2017, Nature)
    • “We built the OHI Toolbox with R, RStudio, Git, and GitHub, which enable our work to be more reproducible and our collaboration more streamlined…And what is extremely powerful is that we also use these same tools and workflow for communication, creating static and interactive documents, presentations, and our website, ohi-science.org.”
  • Unmet needs for analyzing biological big data: A survey of 704 NSF principal investigators (Barone et al. 2017, Biorxiv)
    • “…the most pressing unmet needs are training in data integration, data management, and scaling analyses for HPC, acknowledging that data science skills will be required to build a deeper understanding of life.”
  • A manifesto for reproducible science (Munafò et al., 2016, Nature Human Behavior)
    • “Improving the reliability and efficiency of scientific research will increase the credibility of the published scientific literature and accelerate discovery.”
  • Democratic databases: science on GitHub (Perkel, 2016, Nature)
    • “Scientists are turning to a software–development site to share data and code.”
  • Programming tools: Adventures with R (Tippmann, 2015, Nature)
    • “A guide to the popular, free statistics and visualization software that gives scientists control of their own data analysis.”
  • How scientists use Slack (Perkel, 2016, Nature)
    • “Eight ways labs benefit from the popular workplace messaging tool.”
  • ‘Boot camps’ teach scientists computing skills (Van Noorden, 2014, Nature)
    • “The director of the Mozilla Science Lab discusses its course on scientific computing together with researchers who have taken the training….The project is called ‘Software Carpentry’.”
  • My digital toolbox: Nuclear engineer Katy Huff on version-control systems (Tippmann, 2014, Nature)
    • “Git and GitHub are the ‘laboratory notebook of scientific computing’….Rather than inventing a complex naming scheme for new versions of code, the best practice in software development is to allow the version-control system to track changes over time. “
  • Git can facilitate greater reproducibility and increased transparency in science (Ram, 2013, Source Code for Biology and Medicine)
    • “Version control systems (VCS), which have long been used to maintain code repositories in the software industry, are now finding new applications in science.”
  • My digital toolbox: Ecologist Ethan White on interactive notebooks (Van Noorden, 2014, Nature)
    • “I learned about the IPython notebook in early 2012, and was immediately hooked. The first time I opened one up, it was clear that this tool was going to change the way I worked. I have been using it for both teaching and research ever since. The IPython notebook is a free tool that lets you combine formatted text, code, and the figures and tables that code generates, in a single document.”
  • My digital toolbox: Ecologist Christie Bahlai talks data hygiene (Tippmann, 2014, Nature)
    • “In the first of a regular series, Christie Bahlai, an ecologist at Michigan State University in East Lansing, discusses the software and tools she finds most useful in her research.”
  • Don’t Fear the Command Line! (Troyanskaya 2011, Cell)
    • “Although basic computing skills are routine for most biologists, most of us still struggle with more sophisticated tasks,beyond the ‘‘out of the box’’ solutions.”
  • How open science helps researchers succeed (McKiernan et al., 2016, eLife)
    • “…open research practices bring significant benefits to researchers relative to more traditional closed practices.”
  • Five ways consortia can catalyse open science (Cutcher-Gershenfeld et al., 2017, Nature)
    • “Over the past four years, we have studied more than a dozen scientific consortia involved in data sharing, and we’ve mapped the landscape of these and another such initiatives. When they work well, consortia act as catalysts, to accomplish what members cannot do alone. But scientists are seldom taught effective strategies to design and manage such coalitions. Here we distil the lessons from our fieldwork into five ways to foster open science.”
  • Why scientists must share their research code (Baker, 2016, Nature)
    • “Nature spoke to [data scientist] Stodden about computational reproducibility and the emerging norms of sharing data and code.”
  • Publish your computer code: It is good enough (Barnes, 2010, Nature)
    • “Freely provided working code — whatever its quality — improves programming and enables others to engage with your research”
  • Gene name errors are widespread in the scientific literature (Ziemann et al., 2016, Genome Biology)
    • “The spreadsheet software Microsoft Excel, when used with default settings, is known to convert gene names to dates and floating-point numbers. A programmatic scan of leading genomics journals reveals that approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions.”
  • Open data (Schimel, 2017, Frontiers in Ecology and the Environment)
    • “Ecologists are more and more interested in using larger datasets to complement field studies, or even depending on them as their primary research tool.”
  • Skills and Knowledge for Data-Intensive Environmental Research (Hampton et al., 2017, BioScience)
    • “By proactively addressing the training challenge at a time when the field of data science is still young, environmental scientists will not only guide the environmental research questions but also guide the field toward a culture that is collaborative and inclusive.”
  • Running an open experiment: transparency and reproducibility in soil and ecosystem science (Bond-Lamberty et al. 2016, IOP Science)
    • “Here we describe a recent ‘open experiment’, in which we documented every aspect of a soil incubation online, making all raw data, scripts, diagnostics, final analyses, and manuscripts available in real time. We found that using tools such as version control, issue tracking, and open-source statistical software improved data integrity, accelerated our team’s communication and productivity, and ensured transparency.”
  • Computational reproducibility in archaeological research (Marwick 2017, JAMT)
    • “Four general principles of reproducible research that have emerged in other fields are presented. An archaeological case study is described that shows how each principle can be implemented using freely available software.”

From the news

  • How computers broke science – and what we can do to fix it (Marwick, 2015, The Conversation)
    • “What’s unique about the role of the computer is that we have a solution to the problem. We have clear recommendations for mature tools and well-tested methods borrowed from computer science research to improve the reproducibility of research done by any kind of scientist on a computer. With a small investment of time to learn these tools, we can help restore this cornerstone of science.”
  • For big-data scientists, ‘janitor work’ is key hurdle to insights (Lohr, 2014, The New York Times)
    • “Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.”
  • Want to Make It as a Biologist? Better Learn to Code (Dreyfuss, 2017, WIRED)
    • “When I asked a handful of post-doc biologists eating brunch in Boston last week how many were teaching themselves to code, every hand went up. They all realized that their curriculum was missing a core element, and they’ve set about rectifying the omission—on their own.”
  • Awash in Sea of Data, Ecologists Turn to Open Access Tools (Rennie, 2017, Quanta and WIRED)
    • “Now an essay published this week by Julia S. Stewart Lowndes of NCEAS and her colleagues about how the OHI team quietly overcame its ungainly data problem offers an interesting case study in how macrosystems ecology projects — and even more modestly focused research — can benefit from an open access makeover. Their story also offers a how-to for researchers who might like to follow their example.”
  • For Modern Astronomers, It’s Learn to Code or Get Left Behind (Scoles, 2017, WIRED)
    • “You need the materials upon which the experiment was performed, and you need the tools. Code is the equivalent of our beakers and Bunsen burners…Instead of applying increasingly refined algorithms to their research problems, ill-trained astronomer-coders sometimes spend their time reinventing the wheel….”