Chapter 7 Preparing data
The purpose of Chapter 7 is to introduce you to the basic workflow for preparing data for OHI. This is a a 2-hour hands-on training: you will be following along on your own computer and working with a copy of the demonstration repository (toolbox-demo) that is used throughout the this chapter.
Preparing data takes the biggest chunk of time when you’re using the OHI Toolbox, even more than the final scores calculation itself. That’s when you explore raw data, see whether it fits with your ideal spatial boundaries and whether it makes sense to include it in your final calculations. If it does, you can format it further and save it as data layers, or Toolbox inputs for scores calculations.
The Starter Repo 6 aims to help you wade through these important first steps of the assessment. Treat the preparatory, or “prep”, files as your notebook, calculator, and presentation of your work. Here, the process of data exploration is recorded and can be easily shared with anyone.
In this chapter, we will go through the basic requirements and steps of preparing a data layer:
- understand the formatting requirements
- try a hands-on tutorial in which you can use sample data to format and save
- register a data layer
Before the training starts, please make sure you have done the following:
Have up-to-date versions of
Rand RStudio and have RStudio configured with Git/GitHub
Fork the toolbox-demo repository into your own GitHub account by going to https://github.com/OHI-Science/toolbox-demo, clicking “Fork” in the upper right corner, and selecting your account. Note: if you already have an OHI+ repository of your own, you can use it instead. All of the file names and architecture is the same.
Clone the toolbox-demo repo from your GitHub account into RStudio into a folder called “github” in your home directory (filepath “~/github”). Note: If you’ve already forked and cloned the demo, you can instead pull from your computer to make sure you have the most recent versions of all the files.
Get comfortable: be set up with two screens if possible. You will be following along in RStudio on your own computer while also watching an instructor’s screen or following this tutorial.
toolbox-demo repository is set up, open the
prep folder. It should look like this:
Each goal and sub-goal has its own sub-folder, so you can store raw and intermediate data as you work in R Markdown (or
R). It is highly recommended that data preparation occurs in the
prep folder as much as possible, as it will also be archived by GitHub for future reference.
But before we start preparing a data layer, let’s first go over the data formatting requirements.
7.2 Data Formatting requirements
A data layer is a data file used to calculate scores. The global analysis included over 100 data layer files, and there will probably be as many in your own assessments. Each data layer (data input) has its own .csv file. These data layers are combined to calculate goal scores, meaning that they are inputs for status, trend, pressures, and resilience.
The OHI Toolbox expects each .csv file to have:
- data for every region within the study area
- a unique region identifier (rgn_id) associated with a single score or value.
- data organized in ‘long’ format - as few columns as possible. (You can read more about long formatting here)
OHI goal scores are calculated at the scale of the reporting unit, which is called a ‘region’ and then combined to produce the score for the overall area assessed, called a ‘study area’. For example, the U.S. is a study area, and each coastal state is a region.
In addition, to calculate trend, input data should be available as a time series for at least 5 recent years - and the longer the better, as this can be used in setting temporal reference points.
Finalized data layers have at least two columns: the
rgn_idcolumn and a column with data identified by its units (eg. km2 or score). There often may be a
yearcolumn or a
categorycolumn (for natural product categories or habitat types).
Below are examples of two different data layer files: tourism count (
tr_total.csv) and natural products harvested (
np_harvest_tonnes.csv). They show information for a study area with 4 regions. Each region has multiple years of data. And the second data layer has an additional ‘categories’ column for the different types of natural products that were harvested.
In this example, the two data layers are appropriate for status calculations with the Toolbox because:
- At least five years of data are available
- There are no data gaps
- Data are presented in ‘long’ or ‘narrow’ format
When data gaps, temporal or spatial, are inevitable, various gap-filling techniques can be used. We won’t go through the details here. For more information and examples, visit OHI Manual
7.3 Data Wrangling
So how do we get from raw data to a data layer that looks like the ones shown above? When we first access a data set, we often don’t know whether it is suitable for our purpose, or if it provides adequate information. Raw data can be in a different format than the desired long format and have extra or incomplete information. The main task of data preparation is to comb through the raw data, combine different sources, and sometimes fill in the information gap. We call this process “data wrangling.”
For reproducibility and transparency, it is also good practice to record and share the decision-making process - trials and errors, why you decide to include, or more importantly, exclude a certain data source. Fortunately, we can take notes on the data exploration process and code in one place!
Here is an example of a data prep document. Let’s give it a try to make one just like this.
Now let’s switch to the
demo repo. Today we will use the Clean Water (CW) goal as an hands-on example. We will get sample secchi depth data, as an indicator of water clarity.
Click on the
CW folder and open
CW_data_prep.Rmd. We will follow the tutorial from there.
7.4 Save and register data layers
Note: if you are preparing data from your Starter/Prep repository, you won’t have the layers folder and
layers.csv. You won’t be able to do the following steps until you have the Full repository.
After you have successfully processed a data layer, and it can be used by the Toolbox for calculations, it needs to be saved and registered here:
layers folder is where all data layers live, and where the Toolbox pulls a layer out when needed.
layers.csv is a giant spreadsheet that contains information about each layer - which goal it is used for, filename, column names, etc., and it will direct ohicore to appropriate data layers during calculations.
Let’s first save the layer. You can do that in the prep script. Let’s switch back to the Demo repository, and follow the last section.
7.5 Register data layers in
After we have saved a data layer in
layers folder, we will catalogue the layer in
If a layer simply has a new filename, only the filename column needs to be updated:
However, if a new layer has been added (for example when a new goal model is developed), you will open
layers.csv in a spreadsheet software (i.e. Microsoft Excel), add a new row in the registry for the new data layer and fill in the first eight columns (columns A-H):
- targets: Add the goal/dimension that the new data layer relates to. Goals are indicated with two-letter codes and sub-goals are indicated with three-letter codes, with pressures, resilience, and spatial layers indicated separately.
- layer: Add an identifying name for the new data layer, which will be referenced in R scripts like
functions.Rand .csv files like
- name: Add a longer title for the data layer.
- description: Add a longer description of the new data layer.
- fld_value: Add the appropriate units for the new data layer. It is the same as the column name in the data file, which will be referenced in R scripts in subsequent calculations. (example: area_km2)
- units: Add a description about the units chosen in the fld_value column above. Think about what units you would like to be displayed online when filling out “units.” (example: km^2)
- filename: Add a filename for the new data layer that matches the name of the .csv file that was created previously in the
- fld_id_num: Area designation that applies to the newly created data layer, such as: rgn_id and fao_id.
It is important to check that you have filled you the fields correctly, for instance, if “fld_value” does not match the header of the source data layer, you will see an error message when you try to calculate scores. Other columns are generated later by the Toolbox as it confirms data formatting and content.
Let’s open your layers.csv from your Finder, and we will fill it out together from there:
This is what the new line should look like:
7.6 Organizing metadata
Keeping track of metadata is important as you create these layers. Metadata is information about the data, most importantly, its source. It’s best to provide not only a reference like Feely et al. 2009, but also a url link to where other people could access the data. Other information that can be really useful is the years of data you use, and which goal models the data are used in.
You can keep track of metadata in whatever way makes most sense to you. Since
layers.csv is a file that you will keep updated with every data layer, it can make sense to add information there. You can add as many columns to layers.csv as you’d like. However, this will create a really huge spreadsheet, so maybe you want to keep this information in another place.
If you’ve been using the Data Planner from Chapter ??, that can be a good place to continue keeping track of all the metadata. Or, you could start another spreadsheet.
It doesn’t matter where or how you keep this information, but be sure that you keep it somewhere, and keep it organized and consistent. You’ll need information for every single layer you include, and it’s easier to do as you go along than all at the end. We recommend copying the structure of
layers.csv (if you’re not using
layers.csv for metadata) and having at a minimum columns that are for the data name, Toolbox data layer name, description, reference, and url.
Later on in your assessment, we’ll be able to take information from the columns of your csv and use them for reporting and on your website. An example of this is from the Global assessment: see http://ohi-science.org/ohi-global/layers.
7.7 Data prep vs. functions.r
Where does one end and the other begin?
7.8 Chapter Recap:
Hooray! You have just learned: how to process data, save, and register a data layer for the OHI Toolbox!
- OHI data layer formatting requirements
- data for each region
- long format
- best to have at least five years of continuous data
- how to process a messy and large raw data layer (secchi depth data for CW goal)
- how to saved and register the prepared data layer
Now the layer is ready to be used by the toolbox to calculate status and trend scores. We’ll see you at Chapter 6 8 to learn that!