Our path to better science in less time using open data science tools

This article was published in May 2017 at Nature Ecology & Evolution (DOI: 10.1038/s41559-017-0160. Below is the full text of the article (source repository).

Article text.

Julia S. Stewart Lowndes¹*, Benjamin D. Best², Courtney Scarborough¹, Jamie C. Afflerbach¹, Melanie R. Frazier¹, Casey C. O’Hara¹, Ning Jiang¹, Benjamin S. Halpern^1,3,4

¹National Center for Ecological Analysis and Synthesis, University of California at Santa Barbara, Santa Barbara, CA, United States
²EcoQuants.com, Santa Barbara, CA, United States
³Bren School for Environmental Science and Management, University of California, Santa Barbara, CA, United States
⁴Silwood Park Campus, Imperial College London, Ascot, United Kingdom

*corresponding author: lowndes@nceas.ucsb.edu

Preface

Reproducibility has long been a tenet of science but has been challenging to achieve — we learned this the hard way when our old approaches proved inadequate to efficiently reproduce our own work. Here we describe how several free software tools have fundamentally upgraded our approach to collaborative research, making our entire workflow more transparent and streamlined. By describing specific tools and how we incrementally began using them for the Ocean Health Index project, we hope to encourage others in the scientific community to do the same — so we can all produce better science in less time.

Keywords

collaboration, data science, Ocean Health Index, open science, reproducibility, transparency

Scientists need data science

Science, now more than ever, demands reproducibility, collaboration, and effective communication to strengthen public trust and effectively inform policy. Recent high-profile difficulties in reproducing and repeating scientific studies have put the spotlight on psychology and cancer biology (Baker 2015; Baker and Dolgin 2017; Collaboration 2015), but it is widely acknowledged that reproducibility challenges persist across scientific disciplines (Baker 2016a; Aschwanden 2015; Buck 2015). Environmental scientists face potentially unique challenges in achieving goals of transparency and reproducibility because they rely on vast amounts of data spanning natural, economic, and social sciences that create semantic and synthesis issues exceeding those for most other disciplines (Frew and Dozier 2012; Jones et al. 2006; Michener and Jones 2012). Furthermore, proposed environmental solutions can be complex, controversial, and resource intensive, increasing the need for scientists to work transparently and efficiently with data to foster understanding and trust.

Environmental scientists are expected to work effectively with ever-increasing quantities of highly heterogeneous data even though they are seldom formally trained to do so (Check Hayden 2013; Boettiger et al. 2015; G. Wilson et al. 2016; G. V. Wilson 2006b; Baker 2017). This was recently highlighted by a survey of 704 US National Science Foundation principle investigators in the biological sciences that found training in data skills to be the largest unmet need (Barone, Williams, and Micklos 2017). Without training, scientists tend to develop their own bespoke workarounds to keep pace, but with this comes wasted time struggling to create their own conventions for managing, wrangling, and versioning data. If done haphazardly or without a clear protocol, these efforts are likely to result in work that is not reproducible — by the scientist’s own ‘future self’ or by anyone else (G. Wilson et al. 2016). As a team of environmental scientists tasked with reproducing our own science annually, we experienced this struggle first-hand. When we began our project, we worked with data in the same way as we always had, taking extra care to make our methods reproducible for planned future re-use. But when we began to reproduce our workflow a second time and repeat our methods with updated data, we found our approaches to reproducibility were insufficient. However, by borrowing philosophies, tools, and workflows primarily created for software development, we have been able to dramatically improve the ability for ourselves and others to reproduce our science, while also reducing the time involved to do so: the result is better science in less time.

Here we share a tangible narrative of our transformation to better science in less time — meaning more transparent, reproducible, collaborative, and openly shared and communicated science — with an aim of inspiring others. Our story is only one potential path because there are many ways to upgrade scientific practices — whether collaborating only with your ‘future self’ or as a team — and they depend on the shared commitment of individuals, institutions, and publishers (Wolkovich, Regetz, and O’Connor 2012; Buck 2015; Nosek et al. 2015). We do not review the important, ongoing work regarding data management architecture and archiving (Reichman, Jones, and Schildhauer 2011; Jones et al. 2006), workflows (Shade and Teal 2015; Goodman et al. 2014; Boettiger et al. 2015; Sandve et al. 2013), sharing and publishing data (White et al. 2013; Kervin, Michener, and Cook 2013; Lewandowsky and Bishop 2016; Michener 2015) and code (Mislan, Heer, and White 2016; Kratz and Strasser 2014; Michener 2015), or how to tackle reproducibility and openness in science (Munafò et al. 2017; Martinez et al. 2014; Tuyl and Whitmire 2016; Baker 2016b; Kidwell et al. 2016). Instead, we focus on our experience, because it required changing the way we had always worked, which was extraordinarily intimidating. We give concrete examples of how we use tools and practices from data science, the discipline of ‘turning raw data into understanding’ (Wickham and Grolemund 2016). It was out of necessity that we began to engage in data science, which we did incrementally by introducing new tools, learning new skills, and creating deliberate workflows — all while maintaining annual deadlines. Through our work with academics, governments, and non-profit groups around the world, we have seen that the need to improve practices is common if not ubiquitous. In this narrative we describe specific software tools, why we use them, how we use them in our workflow, and how we work openly as a collaborative team. In doing so we underscore two key lessons we learned that we hope encourage others to incorporate these practices into their own research. The first is that powerful tools exist and are freely available to use; the barriers to entry seem to be exposure to relevant tools and building confidence using them. The second is that engagement may best be approached as an evolution rather than as a revolution that may never come.

Improving reproducibility and collaboration

From then to now

The Ocean Health Index (OHI) operates at the interface of data-intensive marine science, coastal management and policy, and now, data science (Lowndes et al. 2015; Lowndes 2017). It is a scientific framework to quantify ocean-derived benefits to humans and to help inform sustainable ocean management using the best available information (Halpern et al. 2012; Halpern et al. 2015). Assessments using the OHI framework require synthesising heterogeneous data from nearly one hundred different sources, ranging from categorical tabular data to high-resolution rasters. Methods must be reproducible, so that others can produce the same results, and also repeatable, so that newly available data can be incorporated in subsequent assessments. Repeated assessments using the same methods enable quantifiable comparison of changes in ocean health through time, which can be used to inform policy and track progress (Lowndes et al. 2015).

Using the OHI framework, we lead annual global assessments of 220 coastal nations and territories, completing our first assessment in 2012 Halpern et al. (2015). Despite our best efforts, we struggled to efficiently repeat our own work during the second assessment in 2013 because of our approaches to data preparation (Halpern et al. 2015). Data preparation is a critical aspect of making science reproducible but is seldom explicitly reported in research publications; we thought we had documented our methods sufficiently in 130-pages of published supplemental materials (Halpern et al. 2012), but we had not.

However, by adopting data science principles and freely available tools that we describe below, we began building an OHI ‘Toolbox’ and fundamentally changed our approach to science (Figure 1). The OHI Toolbox provides a file structure, data, code, and instruction, operates across computer operating systems, and is shared online for free so that anyone can begin building directly from previous OHI assessments without reinventing the wheel (Lowndes et al. 2015). While these changes required an investment of our team’s time to learn and develop the necessary skills, the pay-off has been substantial. Most significantly we are now able to share and extend our workflow with a growing community of government, non-profit, and academic collaborations around the world that use the OHI for science-driven marine management. There are currently two dozen OHI assessments underway, most of which are led by independent groups (Lowndes et al. 2015), and the Toolbox has helped lower the barriers to entry. Further, our own team has just released the fifth annual global OHI assessment (Index 2016a) and continues to lead assessments at smaller spatial scales, including the Northeastern United States, where the OHI is included in President Obama’s first Ocean Plan (Goldfuss and Holdren 2016).

We thought we were doing reproducible science

For the first global OHI assessment in 2012 we employed an approach to reproducibility that is standard to our field, which focused on scientific methods, not data science methods (Halpern et al. 2012). Data from nearly one hundred sources were prepared manually — i.e. without coding, typically in Microsoft Excel — which included organising, transforming, rescaling, gap-filling, and formatting data. Processing decisions were documented primarily within the Excel files themselves, emails, and Microsoft Word documents. We programmatically coded models and meticulously documented their development, (resulting in the 130-page supplemental materials) (Halpern et al. 2012), and upon publication, we also made the model inputs (i.e., prepared data and metadata) freely available to download. This level of documentation and transparency is beyond the norm for environmental science (Wolkovich, Regetz, and O’Connor 2012; Stephanie E. Hampton et al. 2015).

We also worked collaboratively in the same ways we always had. Our team included scientists and analysts with diverse skill sets and disciplines, and we had distinct, domain-specific roles assigned to scientists and to a single analytical programmer. Scientists were responsible for developing the models conceptually, preparing data, and interpreting modeled results, and the programmer was responsible for coding the models. We communicated and shared files frequently, with long, often-forwarded, and vaguely-titled email chains (e.g. Re: Fwd: data question) with manually versioned data files (e.g. data_final_updated2.xls). All team members were responsible for organising those files with their own conventions on their local computers. Final versions of prepared files were stored on the servers and used in models, but records of the data processing itself were scattered.

Upon beginning the second annual assessment in 2013, we realised that our approach was insufficient since it took too much time and relied heavily on individuals’ data organisation, email chains, and memory — particularly problematic as original team members moved on and new team members joined. We quickly realised we needed a nimble and robust approach to sharing data, methods, and results within and outside our team — we needed to completely upgrade our workflow.

Actually doing reproducible science

As we began the second global OHI assessment in 2013 we faced challenges across three main fronts: 1) reproducibility, including transparency and repeatability, particularly in data preparation; 2) collaboration, including team record keeping and internal collaboration; and 3) communication with scientific and broader communities. Environmental scientists are increasingly using R (Boettiger et al. 2015) because it is free, cross-platform, and open source, and also because of the training and support provided by developers (Wickham and Grolemund 2016) and independent groups (G. Wilson et al. 2016; Mills 2015) alike. We decided to base our work in R (T. R. C. Team 2016) and RStudio (Rs. Team 2016b) for coding and visualisation, Git (G. Team 2016) for version control, GitHub (GitHub 2016) for collaboration, and a combination of GitHub and RStudio for organisation, documentation, project management, online publishing, distribution, and communication (Table 1). These tools can help scientists organise, document, version, and easily share data and methods, thus not only increasing reproducibility but also reducing the amount of time involved to do so (G. V. Wilson 2006a; Broman 2016; Baker 2017). Many available tools are free so long as work is shared publicly online, which enables open science, defined by Hampton et al. (Stephanie E. Hampton et al. 2015) as “the concept of transparency at all stages of the research process, coupled with free and open access to data, code, and papers”. When integrated into the scientific process, data science tools that enable open science — let’s call them “open data science” tools — can help realise reproducibility in collaborative scientific research (Wolkovich, Regetz, and O’Connor 2012; Buck 2015; Stephanie E. Hampton et al. 2015; McKiernan et al. 2016; Seltenrich 2016).

Open data science tools helped us upgrade our approach to reproducible, collaborative, and transparent science, but they did require a substantial investment to learn, which we did incrementally over time (Figure 1; Box 1). Previous to this evolution, most team members with any coding experience — not necessarily in R — had learned just enough to accomplish whatever task had been before them using their own unique conventions. Given the complexity of the OHI project, we needed to learn to code collaboratively and incorporate best (G. Wilson et al. 2014; Haddock and Dunn 2011) or good enough practices (G. Wilson et al. 2016; Barnes 2010) into our coding, so that our methods could be co-developed and vetted by multiple team members. Using a version control system not only improved our file and data management, but allowed individuals to feel less inhibited about their coding contributions, since files could always be reverted back to previous versions if there were problems. We built confidence using these tools by sharing our imperfect code, discussing our challenges, and learning as a team. These tools quickly became the keystone of how we work, and have overhauled our approach to science, perhaps as much as email did in decades prior. They have changed the way we think about science and about what is possible. The following describes how we have been using open data science practices and tools to overcome the biggest challenges we encountered to reproducibility, collaboration, and communication.

Reproducibility

Data preparation - coding and documenting

Our first priority was to code all data preparation, create a standard format for final data layers, and do so using a single programmatic language, R (T. R. C. Team 2016). Code enables us to reproduce the full process of data preparation, from data download to final model inputs (Halpern et al. 2015; Frazier, Longo, and Halpern 2016), and a single language makes it more practical for our team to learn and contribute collaboratively. We code in R and use RStudio (Rs. Team 2016b) to power our workflow because it has a user-friendly interface and built-in tools useful for coders of all skill levels, and, importantly, it can be configured with Git to directly sync with GitHub online. We have succeeded in transitioning to R as our primary coding language for data preparation, including for spatial data, although some operations still require additional languages and tools such as ArcGIS, QGIS, and Python (ESRI 2016; T. Q. Team 2016; T. P. Team 2016).

All our code is underpinned by the principles of tidy data, the grammar of data manipulation, and the tidyverse R packages developed by Wickham (Wickham and Grolemund 2016; Wickham 2014, 2017, 2016). This deliberate philosophy for thinking about data helped bridge our scientific questions with the data processing required to get there, and the readability and conciseness of tidyverse operations makes our data analysis read more as a story arc. Operations require less syntax — which can mean fewer potential errors that are easier to identify — and they can be chained together, minimising intermediate steps and data objects that can cause clutter and confusion (Wickham and Grolemund 2016; Fischetti 2014). tidyverse tools for wrangling data have expedited our transformation as coders and made R less intimidating to learn. We heavily rely on a few packages for data wrangling and visualisation that are bundled in the tidyverse package (Wickham 2016, 2017) — particularly dplyr, tidyr, and ggplot2 — as well as accompanying books, cheatsheets, and archived webinars (Box 1).

We keep detailed documentation describing metadata (e.g., source, date of access, links) and data processing decisions — trying to capture not only the processing we decided to do, but what we decided against. We started with small plain text files accompanying each R file, but have transitioned to documenting with R Markdown (Rs. Team 2016a; Allaire et al. 2016) because it combines plain text and executable chunks of R code within the same file and serves as a living lab notebook. Every time R Markdown output files are regenerated the R code is rerun so the text and figures will also be regenerated and reflect any updates to the code or underlying data. R Markdown files increase our reproducibility and efficiency by streamlining documentation and eliminating the need to constantly paste updated figures into reports as they are developed.

Modeling – `R` functions and packages

Once the data are prepared, we develop assessment-specific models to calculate OHI scores. Models were originally coded in multiple languages to accommodate disparate data types and formatting. By standardising our approach to data preparation and final data layer format, we have been able to translate all models into R. In addition to assessment-specific models, the OHI framework includes core analytical operations that are used by all OHI assessments (Lowndes et al. 2015), and thus we created an R package called ohicore(Index 2016b), which was greatly facilitated by the devtools and roxygen2 packages (Wickham 2015; Wickham and Chang 2016; Wickham, Danenberg, and Eugster 2015). The ohicore package is maintained in and installed from a dedicated GitHub repository — using devtools::install_github(‘ohi-science/ohicore’) — from any computer with R and an internet connection, enabling groups leading independent OHI assessments to use it for their own work (Lowndes et al. 2015).

Version control

We use Git (G. Team 2016) as a version control system. Version control systems track changes within files and allow you to examine or rewind to previous versions. This saves time that would otherwise be spent duplicating, renaming, and organising files to preserve past versions. It also makes folders easier to navigate since they are no longer overcrowded with multiple files suffixed with dates or initials (e.g. final_JL-2012-02-26.csv) (Ram 2013; Blischak, Davenport, and Wilson 2016; Perez-Riverol et al. 2016). Once Git is configured on each team member’s machine, they work as before but frequently commit to saving a snapshot of their files, along with a human-readable “commit message” (Ram 2013; Blischak, Davenport, and Wilson 2016). Any line modified in a file tracked by Git will then be attributed to that user.

We interface with Git primarily through RStudio, using the command line for infrequently encountered tasks. Using RStudio to interact with Git was key for our team’s uptake of a version control system, since the command line can be an intimidating hurdle or even a barrier for beginners to get onboard with using version control. We were less resistant because we could use a familiar interface, and as we gained fluency in Git’s operations through RStudio we translated that confidence to the command line.

Organisation

Our team developed conventions to standardise the structure and names of files to improve consistency and organisation. Along with the GitHub workflow (see Collaboration section), having a structured approach to file organisation and naming has helped those within and outside our team navigate our methods more easily. We organise parts of the project in folders that are both RStudio “projects” and GitHub “repositories”, which has also helped us collaborate using shared conventions rather than each team member spending time duplicating and organising files.

Collaboration within our team

Coding collaboratively

We transitioned from a team of distinct roles (scientists-and-programmer) to becoming a team with overlapping skill sets (scientists-as-programmers, or simply, data scientists). Having both environmental expertise and coding skills in the same person increases project efficiency, enables us to vet code as a team, and reduces the bottleneck of relying on a single programmer. We, like Duhigg (Duhigg 2016), have found that “groups tend to innovate faster, see mistakes more quickly and find better solutions to problems”. Developing these skills and creating the team culture around them requires leadership with the understanding that fostering more efficient and productive scientists is worth the long-term investment. Our team had the freedom to experiment with available tools and their value was recognised with a commitment that we, as a team, would adopt and pursue these methods further. In addition to supportive leadership, having a “champion” with experience of how tools can be introduced over time and interoperate can expedite the process, but is not the only path (Box 2). Taking the time to experiment and invest in learning data science principles, tools, and skills enabled our team to establish a system of best practices for developing, using, and teaching the OHI Toolbox.

Our (simplified) GitHub workflow

GitHub is one of many web-based platforms that enables files tracked with Git to be collaboratively shared online so contributors can keep their work synchronised (GitHub 2016; Blischak, Davenport, and Wilson 2016; Perez-Riverol et al. 2016), and it is increasingly being adopted by scientific communities for project management (J. Perkel 2016). Versioned files are synced online with GitHub similar to the way Dropbox operates, except syncs require a committed, human-readable message and reflect deliberate snapshots of changes made that are attributed to the user, line-by-line, through time. Built for large, distributed teams of software developers, GitHub provides many features that we as a scientific team, new to data science, do not immediately need, and thus we mostly ignore features such as branching, forking, and pull requests. Our team uses a simplified GitHub workflow whereby all members have administrative privileges to the repositories within our ohi-science organisation. Each team member is able to sync their local work to GitHub.com, making it easier to attribute contribution, as well as identify to whom to direct questions.

GitHub is now central to many facets of our collaboration as a team and with other communities — we use it along with screensharing to teach and troubleshoot with groups leading OHI assessments, as well as to communicate our ongoing work and final results (see Communication section below). Now there are very few files emailed back and forth within our team since we all have access to all repositories within the ohi-science organisation, and can navigate to and edit whatever we need. Additionally, these organised files are always found with the same file path, whether on GitHub.com or on someone’s local computer; this, along with RStudio .Rproj files, eases the file path problems that can plague collaborative coding and frustrate new coders.

Internal communication

We use a feature of GitHub called ‘Issues’ in place of email for discussions about data preparation and analysis. We use Issues in a separate private repository to keep our conservations private but our work public. All team members can see and contribute to all conversations, which are a record of all our decisions and discussions across the project and are searchable in a single place. Team members can communicate clearly by linking to specific lines of code in current or past versions of specific files since they are stored on GitHub and thus have a URL, as well as paste images and screenshots, link to other websites, and send an email to specific team members directly by mentioning their GitHub username. In addition to discussing analytical options, we use Issues to track ongoing tasks, tricks we have learned, and future ideas. Issues provide a written reference of institutional memory so new team members can get up to speed more easily. Most importantly, GitHub Issues have helped us move past the never-ending forwarded email chains and instead to conversations available to any current or future team member.

Communication outside the project

Meeting scientists where they are

We are environmental scientists whose impetus for upgrading approaches to collaborative, data-intensive science was driven by our great difficulty reproducing our own methods. Many researchers do not attempt to reproduce their own work (Nosek et al. 2015; Casadevall and Fang 2010) — ourselves included before 2013 — and thus may not realise that there could be reproducibility issues in their own approaches. But they can likely identify inefficiencies. Integrating open data science practices and tools into science can save time, while also improving reproducibility for our most important collaborator: our ‘future selves’. We have found this as individuals and as a team: We could not be as productive (Lowndes et al. 2015; Lowndes 2017) without open data science practices and tools. We would also not be able to efficiently share and communicate our work while it is ongoing rather than only post-publication, which is particularly important for bridging science and policy. As environmental scientists who are still learning, we hope sharing our experiences will empower other scientists to upgrade their own approaches, helping further shift the scientific culture to value transparency and openness as a benefit to all instead of as a vulnerability (Wolkovich, Regetz, and O’Connor 2012; Stephanie E. Hampton et al. 2015; McKiernan et al. 2016).

From our own experience and from teaching other academic, non-profit, and government groups through the Ocean Health Index project (Lowndes et al. 2015), we find that the main barriers to engagement boil down to exposure and confidence: first knowing which tools exist that can be directly useful to one’s research, and then having the confidence to develop the skills to use them. These two points are simple but critical. We are among the many environmental scientists who were never formally trained to work deliberately with data. Thus, we were unaware of how significantly open data science tools could directly benefit our research (Boettiger et al. 2015; G. Wilson 2016), and upon learning about them we were hesitant, or even resistant, to engage. However, we were able to develop confidence in large part because of the open, inclusive, and encouraging online developer community that builds tools and creates tutorials that meet scientists where they are (Box 1, Box 2). It takes motivation, patience, diligence, and time to overcome the conceptual and technical challenges involved in developing computing skills, but resources are available to help scientists get started (G. Wilson 2016; Boettiger et al. 2015; Haddock and Dunn 2011). Coding is “as important to modern scientific research as telescopes and test tubes” (G. Wilson et al. 2014), but it is critical to “dispel the misconception that these skills are intuitive, obvious, or in any way inherent” (Mills 2015).

There is ongoing and important work by the informatics community on the architecture and systems for data management and archiving (Reichman, Jones, and Schildhauer 2011; Frew and Dozier 2012; Jones et al. 2006; Stephanie E. Hampton et al. 2013), as well as efforts to enable scientists to publish the code that they do have (Barnes 2010; Mislan, Heer, and White 2016; Baker 2016b). This work is critical, but comes with the a priori assumption that scientists are already thinking about data and coding in a way that they would seek out further resources. In reality, this is not always the case, and without visible examples of how to use these tools within their scientific fields, common stumbling blocks will be continually combatted with individual workarounds instead of addressed with intention. These workarounds can greatly delay focusing on actual scientific research, particularly when scientific questions that may not yet have answers — e.g., how the behavior of X changes with Y — are conflated with data science questions that have many existing answers — e.g., how to operate on only criteria X and Y.

Scientific advancement comes from building off the past work of others; scientists can also embrace this principle for using software tools to tackle some of the challenges encountered in modern scientific research. In a recent survey in Nature, 90% of the 1,500 respondents across scientific fields agreed that there was a reproducibility crisis in science, and one third of the respondents reported not having their own “established procedures for reproducibility” (Baker 2016a). While reproducibility means distinct things within the protocols of each sub-discipline or specialty, underpinning reproducibility across all disciplines in modern science is working effectively and collaboratively with data, including wrangling, formatting, and other tasks that can take 50-80% of a data scientist’s time (Lohr 2014). While reaching full reproducibility is extremely difficult (FitzJohn et al. 2014; Aschwanden 2015), incrementally incorporating open data science practices and tools into scientific workflows has the potential to alleviate many of the troubles plaguing science, including collaboration and preserving institutional memory (G. Wilson et al. 2016). Further, sharing openly is fundamental to truly expediting scientific progress because others can build directly off previous work if well-documented, re-usable code are available (Wolkovich, Regetz, and O’Connor 2012; McKiernan et al. 2016; Broman 2016; Boland, Karczewski, and Tatonetti 2017). Until quite recently, making research open required a great deal of extra work for researchers and was less likely to be done. Now, with available tools, the benefits of openness can be a byproduct of time-saving efficiencies, because tools that reduce data headaches also result in science that is more transparent, reproducible, collaborative, and freely accessible to others.

Ecologists and environmental scientists arguably have a heightened responsibility for transparency and openness, as data products provide important snapshots of systems that may be forever altered due to climate change and other human pressures (Wolkovich, Regetz, and O’Connor 2012; Reichman, Jones, and Schildhauer 2011). There is particular urgency for efficiency and transparency, as well as opportunity to democratise science in fields that operate at the interface of science and policy. Individuals play an important part by promoting good practices and creating supportive communities (Wolkovich, Regetz, and O’Connor 2012; Mills 2015; McKiernan et al. 2016). But it is also critical for the broader science community to build a culture where openness and reproducibility are valued, formally taught, and practiced, where we all agree that they are worth the investment.

References

Acknowledgements

The Ocean Health Index is a collaboration between Conservation International and the National Center for Ecological Analysis and Synthesis at the University of California at Santa Barbara. We thank Johanna Polsenberg, Steve Katona, Erich Pacheco, and Lindsay Mosher who are our partners at Conservation International. We thank all past contributors and funders that have supported the Ocean Health Index, including Beau and Heather Wrigley and The Pacific Life Foundation. We also thank all the individuals and groups that openly make their data, tools, and tutorials freely available to others. Finally, we thank Hadley Wickham, Karthik Ram, Kara Woo, and Mark Schildhauer for friendly review of the developing manuscript.

See ohi-science.org/betterscienceinlesstime as an example of a website built with RMarkdown and the RStudio-GitHub workflow, and for links and resources referenced in the paper.

Author Contributions

All authors developed concepts and wrote the paper.

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Julia Lowndes: lowndes@nceas.ucsb.edu

Figures

Figure 1

Better science in less time, illustrated by the Ocean Health Index project. Every year since 2012 we have repeated Ocean Health Index (OHI) methods to track change in global ocean health (Halpern et al. 2012; Lowndes 2017). Increased reproducibility and collaboration has reduced the amount of time required to repeat methods (size of bubbles) with updated data annually, allowing us to focus on improving methods each year (biggest innovations written as text). The original assessment in 2012 focused solely on scientific methods (e.g., obtaining and analyzing data; developing models; calculating and presenting results (dark shading)). In 2013, by necessity we gave more focus to data science (e.g., data organisation and wrangling; coding; versioning; documentation (light shading)), using open data science tools. We established R as the main language for all data preparation and modeling (using RStudio), which drastically decreased the time involved to complete the assessment. In 2014, we adopted Git and GitHub for version control, project management, and collaboration. This further decreased the time required to repeat the assessment. We also created the OHI Toolbox, which includes our R package ohicore for core analytical operations used in all OHI assessments. In subsequent years we have continued (and plan to continue) this trajectory towards better science in less time by improving code with principles of tidy data (Wickham and Grolemund 2016); standardising file and data structure; and focusing more on communication, in part by creating websites with the same open data science tools and workflow. See text and Table 1 for more details.

Tables

Table 1

Summary of the primary open data science tools used to upgrade reproducibility, collaboration, and communication, by task. The transition to using open data science tools was incremental (see Figure 1). All tasks are accomplished with the RStudio–GitHub workflow that is underpinned by R and Git. This workflow streamlines collaboration by capturing each individual’s contribution to the project – thus taking care of bookkeeping – for tasks from data processing and analysis to creating documents and websites with embedded results that are updatable. Note that collaboration is not only for labs and teams, but also for each individual’s ‘future self’.

	Task	Then	Now	Primary open data science tools
Reproducibility	data preparation	manually (i.e., Excel)	coded in `R`	`R` packages: `tidyverse` (`dplyr`, `tidyr`, `ggplot2`). Documentation: R Markdown
	modeling	multiple programming languages	`R` functions and `ohicore` package	`R` packages: `tidyverse`, `devtools`, `roxygen2`, `git2r`
	version control	file duplication and renaming	Git	Git; interface with Git and GitHub primarily through RStudio
	organisation	individual conventions	standardised team convention	RStudio projects, GitHub repositories. File structure protocols
Collaboration	coding	separate languages and conventions	`R`; standardised team convention	Principles of tidy data; `tidyverse`
	workflow and project management	individual conventions	(simplified) GitHub workflow	GitHub, RStudio
	internal collaboration	email	centralised, archived conversations	GitHub issues
Communication	sharing data	ftp download	all versions and Releases available online	ohi-science.org/ohi-global
	sharing methods	published manuscript and supplementary material	published on our website	ohi-science.org website, with linked R Markdown outputs (webpages, presentations, etc)

Boxes

Box 1

Resources to learn open data science tools. These are some of the free, online resources that we used to learn and develop a workflow with R, RStudio, Git, and GitHub. These resources exposed us to what was possible, and helped us build skills to incorporate concepts and tools into our own workflow. This is by no means an exhaustive list. See also Box 2 for strategies of how to get started.

Primarily R

R for Data Science book by Hadley Wickham and Garrett Grolemund
        (r4ds.had.co.nz)
RStudio’s webinars on-demand videos by RStudio
        (rstudio.com/resources/webinars)
RStudio’s cheatsheets PDFs by RStudio
        (rstudio.com/resources/cheatsheets)
CRAN Task Views to identify useful packages by category of task
        (cran.r-project.org/web/views)
R Packages book by Hadley Wickham
        (r-pkgs.had.co.nz)

Combination RStudio-GitHub

Happy Git With R short-course by Jenny Bryan
        (happygitwithr.com)
UBC Stats545: Data Wrangling, Exploration, and Analysis with R university course by Jenny Bryan
        (stat545.com)
Software Carpentry workshops, teaching and learning communities
        (software-carpentry.org)
            example 2-day course: “Reproducible Science with RStudio and GitHub”
            jules32.github.io/2016-07-12-Oxford/overview

Community discussion

#rstats on Twitter online discussions
        (twitter.com/search?q=%23rstats&src=typd)
Not So Standard Deviations podcast by Roger Peng and Hilary Parker
        (soundcloud.com/nssd-podcast)
R-Bloggers blog
        (r-bloggers.com)
RStudio blog
        (blog.rstudio.org)
Data Carpentry blog
        (datacarpentry.org/blog)

Box 2

Strategies to learn in an intentional way. The resources listed in Box 1 have helped us learn open data science principles and tools in an intentional way: We felt empowered (vs. panicked), we learned to think ahead (vs. quick fixes for single purposes), and we learned with a community (vs. in isolation). There is a whole ecosystem of open data science principles, practices, and tools (including R, RStudio, Git, and GitHub) and no single way to begin learning. These are a few strategies you can consider as you get engaged.

Self-paced learning
Box 1 lists resources to learn open data science principles and tools that you can use at your own pace. The books and courses provide in-depth philosophies and are good for initial learning as well as for reference later on. Webinars and podcasts are generally under an hour.

Join and/or create communities
Learning together and supporting each other peer-to-peer can be more fun and rewarding. You can become a “champion” for others by showing leadership as you learn. Start off by watching a webinar with a friend or group during lunch or a happy hour. Learn enough about a useful R package to share in your lab meetings; you learn best by teaching. In traditional journal clubs or lab meetings, discuss an academic article on importance of reproducibility, collaboration, and coding (Baker 2017; J. M. Perkel 2014; White et al. 2013; Perez-Riverol et al. 2016). Search if your institution or city has local Meetup.com groups, or create your own.

Additionally, join or keep tabs on communities online. Mozilla Study Groups are a network of ‘journal-clubs’ where scientists teach scientists computing skills (science.mozilla.org/programs/studygroups/join). rOpenSci is a developer collective building R-based tools to facilitate open science (ropensci.org). Also look on Twitter for #rstats discussions and then follow individuals from those conversations.

Ask for help
Local and online communities are a great resource to ask when you need help. Expecting that someone has already asked your question can help you both articulate the problem clearly and identify useful answers. Often, pasting error messages directly into Google will get you to the best answers quickly. Many answers come from online forums, including StackOverflow.com (Baker 2017), or even Twitter itself (e.g., ‘How Twitter Improved My Ecological Model’) (Rbloggers 2015).

Attend in-person workshops and conferences
In-person workshops can be extremely valuable and give you an opportunity to get direct help from instructors and helpers. Software Carpentry and Data Carpentry run 2-day bootcamps that teach skills for research computing; you can attend a scheduled workshop or request your own (software-carpentry.org; datacarpentry.org). Attend conferences like useR (example: user2017.brussels) both for skill-building and to learn how others are using these tools.

Watch presentations from past conferences
More and more, slide decks and videos of presentations are appearing online. For example, you can see presentations from the the 2016 useR conference (user2016.org) and the 2017 RStudio conference (rstudio.com/conference).

Read Blogs

There are many individuals who blog about open data science concepts, R packages, workflows, etc. Try Googling a package you’re using, or going to the website of someone you are following on Twitter.

References

Allaire, J. J., Joe Cheng, Yihui Xie, Jonathan McPherson, Winston Chang, Jeff Allen, Hadley Wickham, Aron Atkins, and Rob Hyndman. 2016. “Rmarkdown: Dynamic Documents for r.” https://CRAN.R-project.org/package=rmarkdown.

Aschwanden, Christie. 2015. “Science Isn’t Broken. FiveThirtyEight.” August 19, 2015. https://fivethirtyeight.com/features/science-isnt-broken/.

Baker, Monya. 2015. “Over Half of Psychology Studies Fail Reproducibility Test.” Nature, August. https://doi.org/10.1038/nature.2015.18248.

———. 2016a. “1,500 Scientists Lift the Lid on Reproducibility.” Nature 533 (7604): 452–54. https://doi.org/10.1038/533452a.

———. 2016b. “Why Scientists Must Share Their Research Code.” Nature, September. https://doi.org/10.1038/nature.2016.20504.

———. 2017. “Scientific Computing: Code Alert.” Nature 541 (7638): 563–65. https://doi.org/10.1038/nj7638-563a.

Baker, Monya, and Elie Dolgin. 2017. “Cancer Reproducibility Project Releases First Results.” Nature News 541 (7637): 269. https://doi.org/10.1038/541269a.

Barnes, Nick. 2010. “Publish Your Computer Code: It Is Good Enough.” Nature 467 (7317): 753. http://search.proquest.com/openview/920d61b7360c59dd04185c0f13bb92ad/1?pq-origsite=gscholar&cbl=40569.

Barone, Lindsay, Jason Williams, and David Micklos. 2017. “Unmet Needs for Analyzing Biological Big Data: A Survey of 704 NSF Principal Investigators.” bioRxiv, February, 108555. https://doi.org/10.1101/108555.

Blischak, John D., Emily R. Davenport, and Greg Wilson. 2016. “A Quick Introduction to Version Control with Git and GitHub.” PLOS Comput Biol 12 (1): e1004668. https://doi.org/10.1371/journal.pcbi.1004668.

Boettiger, Carl, Scott Chamberlain, Edmund Hart, and Karthik Ram. 2015. “Building Software, Building Community: Lessons from the rOpenSci Project.” Journal of Open Research Software 3 (1). https://doi.org/10.5334/jors.bu.

Boland, Mary Regina, Konrad J. Karczewski, and Nicholas P. Tatonetti. 2017. “Ten Simple Rules to Enable Multi-Site Collaborations Through Data Sharing.” PLOS Computational Biology 13 (1): e1005278. https://doi.org/10.1371/journal.pcbi.1005278.

Broman, Karl. 2016. “Initial Steps Toward Reproducible Research.” 2016. http://kbroman.org/steps2rr/.

Buck, Stuart. 2015. “Solving Reproducibility.” Science 348 (6242): 1403–3. https://doi.org/10.1126/science.aac8041.

Casadevall, Arturo, and Ferric C. Fang. 2010. “Reproducible Science.” Infection and Immunity 78: 4972–75. https://doi.org/doi:10.1128/IAI.00908-10.

Check Hayden, Erika. 2013. “Mozilla Plan Seeks to Debug Scientific Code.” Nature News 501 (7468): 472. https://doi.org/10.1038/501472a.

Collaboration, Open Science. 2015. “Estimating the Reproducibility of Psychological Science.” Science 349 (6251): aac4716. https://doi.org/10.1126/science.aac4716.

Duhigg, Charles. 2016. “What Google Learned from Its Quest to Build the Perfect Team.” The New York Times, February. http://www.nytimes.com/2016/02/28/magazine/what-google-learned-from-its-quest-to-build-the-perfect-team.html.

ESRI. 2016. “ArcGIS Platform.” http://www.esri.com/software/arcgis.

Fischetti, Tony. 2014. “How Dplyr Replaced My Most Common r Idioms. StatsBlogs.” 2014. http://www.onthelambda.com/2014/02/10/how-dplyr-replaced-my-most-common-r-idioms/.

FitzJohn, Rich, Matt Pennell, Amy Zanne, and Will Cornell. 2014. “Reproducible Research Is Still a Challenge. ROpenSci.” Website. 2014. /blog/2014/06/09/reproducibility.

Frazier, Melanie, Catherine Longo, and Benjamin S. Halpern. 2016. “Mapping Uncertainty Due to Missing Data in the Global Ocean Health Index.” PLOS ONE 11 (8): e0160377. https://doi.org/10.1371/journal.pone.0160377.

Frew, J. E., and J. Dozier. 2012. “Environmental Informatics.” Annual Review of Environment and Resources 37 (1): 449–72. https://doi.org/10.1146/annurev-environ-042711-121244.

GitHub. 2016. “GitHub: A Collaborative Online Platform to Build Software.” GitHub. https://github.com.

Goldfuss, C., and Holdren. 2016. “The Nation’s First Ocean Plans. The White House of President Barack Obama.” December 7, 2016. https://obamawhitehouse.archives.gov/blog/2016/12/07/nations-first-ocean-plans.

Goodman, Alyssa, Alberto Pepe, Alexander W. Blocker, Christine L. Borgman, Kyle Cranmer, Merce Crosas, Rosanne Di Stefano, et al. 2014. “Ten Simple Rules for the Care and Feeding of Scientific Data.” PLOS Computational Biology 10 (4): e1003542. https://doi.org/10.1371/journal.pcbi.1003542.

Haddock, Steven H. D., and Casey W. Dunn. 2011. Practical Computing for Biologists. 1st ed. Sunderland, Massachusetts, USA: Sinauer Associates, Inc.

Halpern, Benjamin S., Catherine Longo, Darren Hardy, Karen L. McLeod, Jameal F. Samhouri, Steven K. Katona, Kristin Kleisner, et al. 2012. “An Index to Assess the Health and Benefits of the Global Ocean.” Nature. https://doi.org/10.1038/nature11397.

Halpern, Benjamin S., Catherine Longo, Julia S. Stewart Lowndes, Benjamin D. Best, Melanie Frazier, Steven K. Katona, Kristin M. Kleisner, Andrew A. Rosenberg, Courtney Scarborough, and Elizabeth R. Selig. 2015. “Patterns and Emerging Trends in Global Ocean Health.” PLoS ONE 10 (3): e0117863. https://doi.org/10.1371/journal.pone.0117863.

Hampton, Stephanie E., Sean S. Anderson, Sarah C. Bagby, Corinna Gries, Xueying Han, Edmund M. Hart, Matthew B. Jones, et al. 2015. “The Tao of Open Science for Ecology.” Ecosphere 6 (7): 1–13. https://doi.org/10.1890/ES14-00402.1.

Hampton, Stephanie E, Carly A Strasser, Joshua J Tewksbury, Wendy K Gram, Amber E Budden, Archer L Batcheller, Clifford S Duke, and John H Porter. 2013. “Big Data and the Future of Ecology.” Frontiers in Ecology and the Environment 11 (3): 156–62. https://doi.org/10.1890/120103.

Index, Ocean Health. 2016a. “Five Years of Global Ocean Health Index Assessments.” 2016. http://ohi-science.org/ohi-global/.

———. 2016b. “Ocean Health Index Ohicore Package.” National Center for Ecological Analysis; Synthesis, University of California, Santa Barbara. https://github.com/OHI-Science/ohicore/releases/edit/v2016.1.

Jones, Matthew B., Mark P. Schildhauer, O. J. Reichman, and Shawn Bowers. 2006. “The New Bioinformatics: Integrating Ecological Data from the Gene to the Biosphere.” Annual Review of Ecology, Evolution, and Systematics 37: 519–44. http://www.jstor.org/stable/30033842.

Kervin, Karina, William Michener, and Robert Cook. 2013. “Common Errors in Ecological Data Sharing.” Journal of eScience Librarianship 2 (2). https://doi.org/10.7191/jeslib.2013.1024.

Kidwell, Mallory C., Ljiljana B. Lazarević, Erica Baranski, Tom E. Hardwicke, Sarah Piechowski, Lina-Sophia Falkenberg, Curtis Kennett, et al. 2016. “Badges to Acknowledge Open Practices: A Simple, Low-Cost, Effective Method for Increasing Transparency.” PLOS Biology 14 (5): e1002456. https://doi.org/10.1371/journal.pbio.1002456.

Kratz, John, and Carly Strasser. 2014. “Data Publication Consensus and Controversies.” F1000Research, October. https://doi.org/10.12688/f1000research.3979.3.

Lewandowsky, Stephan, and Dorothy Bishop. 2016. “Research Integrity: Don’t Let Transparency Damage Science.” Nature News 529 (7587): 459. https://doi.org/10.1038/529459a.

Lohr, Steve. 2014. “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights.” The New York Times, August. http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html.

Lowndes, Julia S. Stewart. 2017. “A Biography of the Ocean Health Index. OHI-Science.” 2017. http://ohi-science.org/news/Biography-OHI.

Lowndes, Julia S. Stewart, Erich J. Pacheco, Benjamin D. Best, Courtney Scarborough, Catherine Longo, Steven K. Katona, and Benjamin S. Halpern. 2015. “Best Practices for Assessing Ocean Health in Multiple Contexts Using Tailorable Frameworks.” PeerJ 3 (December): e1503. https://doi.org/10.7717/peerj.1503.

Martinez, Ciera, Jeff Hollister, Ben Marwick, Eduard Szöcs, Sam Zeitlin, Bruno P. Kinoshita, Stephanie Wykstra, Jeff Leek, Nick Reich, and Billy Meinke. 2014. “Reproducibility in Science: A Guide to Enhancing Reproducibility in Scientific Results and Writing.” http://ropensci.github.io/reproducibility-guide/.

McKiernan, Erin C., Philip E. Bourne, C. Titus Brown, Stuart Buck, Amye Kenall, Jennifer Lin, Damon McDougall, et al. 2016. “How Open Science Helps Researchers Succeed.” eLife 5 (July): e16800. https://doi.org/10.7554/eLife.16800.

Michener, William K. 2015. “Ten Simple Rules for Creating a Good Data Management Plan.” PLOS Computational Biology 11 (10): e1004525. https://doi.org/10.1371/journal.pcbi.1004525.

Michener, William K., and Matthew B. Jones. 2012. “Ecoinformatics: Supporting Ecology as a Data-Intensive Science.” Trends in Ecology & Evolution 27 (2): 85–93. https://doi.org/10.1016/j.tree.2011.11.016.

Mills, Bill. 2015. “Introducing Mozilla Science Study Groups. Mozilla Science.” 2015. https://science.mozilla.org/blog/introducing-mozilla-science-study-groups.

Mislan, K. A. S., Jeffrey M. Heer, and Ethan P. White. 2016. “Elevating the Status of Code in Ecology.” Trends in Ecology & Evolution 31 (1): 4–7. https://doi.org/10.1016/j.tree.2015.11.006.

Munafò, Marcus R., Brian A. Nosek, Dorothy V. M. Bishop, Katherine S. Button, Christopher D. Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J. Ware, and John P. A. Ioannidis. 2017. “A Manifesto for Reproducible Science.” Nature Human Behaviour 1 (January): 0021. https://doi.org/10.1038/s41562-016-0021.

Nosek, B. A., G. Alter, G. C. Banks, D. Borsboom, S. D. Bowman, S. J. Breckler, S. Buck, et al. 2015. “Promoting an Open Research Culture.” Science 348 (6242): 1422–25. https://doi.org/10.1126/science.aab2374.

Perez-Riverol, Yasset, Laurent Gatto, Rui Wang, Timo Sachsenberg, Julian Uszkoreit, Felipe da Veiga Leprevost, Christian Fufezan, et al. 2016. “Ten Simple Rules for Taking Advantage of Git and GitHub.” PLOS Comput Biol 12 (7): e1004947. https://doi.org/10.1371/journal.pcbi.1004947.

Perkel, Jeffrey. 2016. “Democratic Databases: Science on GitHub.” Nature 538 (7623): 127–28. https://doi.org/10.1038/538127a.

Perkel, Jeffrey M. 2014. “Scientific Writing: The Online Cooperative.” Nature 514 (7520): 127–28. https://doi.org/10.1038/514127a.

Ram, Karthik. 2013. “Git Can Facilitate Greater Reproducibility and Increased Transparency in Science.” Source Code for Biology and Medicine 8: 7. https://doi.org/10.1186/1751-0473-8-7.

Rbloggers. 2015. “How Twitter Improved My Ecological Model. Rbloggers.” 2015. https://www.r-bloggers.com/how-twitter-improved-my-ecological-model/.

Reichman, O. J., Matthew B. Jones, and Mark P. Schildhauer. 2011. “Challenges and Opportunities of Open Data in Ecology.” Science 331 (6018): 703–5. https://doi.org/10.1126/science.1197962.

Sandve, Geir Kjetil, Anton Nekrutenko, James Taylor, and Eivind Hovig. 2013. “Ten Simple Rules for Reproducible Computational Research.” PLOS Computational Biology 9 (10): e1003285. https://doi.org/10.1371/journal.pcbi.1003285.

Seltenrich, Nate. 2016. “Scaling the Heights of Data Science.” Breakthroughs: The Magazine of the UC Berkeley College of Natural Resources Fall. https://nature.berkeley.edu/breakthroughs/opensci-data.

Shade, Ashley, and Tracy K. Teal. 2015. “Computing Workflows for Biologists: A Roadmap.” PLOS Biol 13 (11): e1002303. https://doi.org/10.1371/journal.pbio.1002303.

Team, Git. 2016. “Git Version Control System.” git. https://git-scm.com/.

Team, RStudio. 2016a. “R Markdown.” 2016. http://rmarkdown.rstudio.com/.

———. 2016b. “RStudio: Integrated Development for r.” Boston: RStudio, Inc. www.rstudio.com.

Team, The Python. 2016. “Python.org.” https://www.python.org/.

Team, The QGIS. 2016. “QGIS Project.” http://www.qgis.org/en/site/.

Team, The R Core. 2016. “R: A Language and Environment for Statistical Computing.” Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Tuyl, Steven Van, and Amanda L. Whitmire. 2016. “Water, Water, Everywhere: Defining and Assessing Data Sharing in Academia.” PLOS ONE 11 (2): e0147942. https://doi.org/10.1371/journal.pone.0147942.

White, Ethan P., Elita Baldridge, Zachary T. Brym, Kenneth J. Locey, Daniel J. McGlinn, and Sarah R. Supp. 2013. “Nine Simple Ways to Make It Easier to (Re)use Your Data.” Ideas in Ecology and Evolution 6 (2). http://ojs.library.queensu.ca/index.php/IEE/article/view/4608.

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23. https://doi.org/10.18637/jss.v059.i10.

———. 2015. R Packages. O’Reilly. http://r-pkgs.had.co.nz/.

———. 2016. “Tidyverse: Easily Install and Load ’Tidyverse’ Packages.” https://CRAN.R-project.org/package=tidyverse.

———. 2017. “Tidyverse Website • Tidyweb.” 2017. http://tidyverse.org/.

Wickham, Hadley, and Winston Chang. 2016. “Devtools: Tools to Make Developing r Packages Easier.” https://CRAN.R-project.org/package=devtools.

Wickham, Hadley, Peter Danenberg, and Manuel Eugster. 2015. “Roxygen2: In-Source Documentation for r.” https://CRAN.R-project.org/package=roxygen2.

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science. O’Reilly. http://r4ds.had.co.nz/.

Wilson, Greg. 2016. “Software Carpentry: Lessons Learned.” F1000Research, January. https://doi.org/10.12688/f1000research.3-62.v2.

Wilson, Greg V. 2006a. “Software Carpentry: Getting Scientists to Write Better Code by Making Them More Productive.” Computing in Science and Engineering, 66–69.

———. 2006b. “Where’s the Real Bottleneck in Scientific Computing?” Sigma Xi, The Scientific Research Society Jan-Feb: 5–6. http://www.americanscientist.org/libraries/documents/200512511610_307.pdf.

Wilson, Greg, D. A. Aruliah, C. Titus Brown, Neil P. Chue Hong, Matt Davis, Richard T. Guy, Steven H. D. Haddock, et al. 2014. “Best Practices for Scientific Computing.” PLOS Biol 12 (1): e1001745. https://doi.org/10.1371/journal.pbio.1001745.

Wilson, Greg, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy K. Teal. 2016. “Good Enough Practices in Scientific Computing.” arXiv:1609.00037 [Cs], August. http://arxiv.org/abs/1609.00037.

Wolkovich, Elizabeth M., James Regetz, and Mary I. O’Connor. 2012. “Advances in Global Change Research Require Open Science by Individual Researchers.” Global Change Biology 18 (7): 2102–10. https://doi.org/10.1111/j.1365-2486.2012.02693.x.

Our path to better science in less time using open data science tools

Article text.

Preface

Keywords

Scientists need data science

Improving reproducibility and collaboration

From then to now

We thought we were doing reproducible science

Actually doing reproducible science

Reproducibility

Data preparation - coding and documenting

Modeling – R functions and packages

Version control

Organisation

Collaboration within our team

Coding collaboratively

Our (simplified) GitHub workflow

Internal communication

Communication outside the project

Sharing data and code

Sharing methods and instruction

Meeting scientists where they are

References

Acknowledgements

Author Contributions

Competing interests

Corresponding author

Figures

Figure 1

Tables

Table 1

Boxes

Box 1

Box 2

References

Modeling – `R` functions and packages