1 Summary

This analysis converts FAO commodities data into data layers used to calculate OHI 2018 global natural products scores.

2 Updates from previous assessment

New year of FAO data (1976-2016). To stabilize the model, we include only commodities with >= 4 non-zero/non-NA values within the most recent 10 years (vs. the entire time frame we previously used).

For next year, consider only including commodities with USD value > 1. This may further improve the stability of the model. Also want to narrow the products to just fish oil and seaweed to mitigate model instability.

3 Data Source

Reference:
http://www.fao.org/fishery/statistics/software/fishstatj/en#downlApp App release date: May 2019 FAO raw commodities quantity 1976_2016 FAO raw commodities value 1976_2016 FAO metadata found here

Downloaded: July 22 2019

Description: Quantity (tonnes) and value (USD) of raw commodities (Exports only) for each country, taxa, year. The FAO data is subset to include commodities in these categories: shells, corals, ornamental fish, fish oil, seaweed and plants, sponges (see: raw/commodities2products.csv for details).

Time range: 1976-2016

4 Methods

knitr::opts_chunk$set(eval=FALSE)

## load libraries, set directories
library(ohicore)  #devtools::install_github('ohi-science/ohicore@dev')
library(dplyr)
library(stringr)
library(tidyr)
library(zoo)  
library(ggplot2)
library(here)
library(tidyverse)
library(plotly)


## Load FAO-specific user-defined functions
source(here('workflow/R/fao_fxn.R')) # function for cleaning FAO files
source(here('workflow/R/common.R')) # directory locations

## This file makes it easier to process data for the OHI global assessment
##  by creating the following objects:
## 
##  * dir_M = identifies correct file path to Mazu (internal server) based on your operating system
##  * mollCRS = the crs code for the mollweide coordinate reference system we use in the global assessment
##  * regions_shape() = function to load global shapefile for land/eez/high seas/antarctica regions
##  * ohi_rasters() = function to load two rasters: global eez regions and ocean region
##  * region_data() = function to load 2 dataframes describing global regions 
##  * rgn_syns() = function to load dataframe of region synonyms (used to convert country names to OHI regions)
##  * low_pop() = function to load dataframe of regions with low and no human population
##  * UNgeorgn = function to load dataframe of UN geopolitical designations used to gapfill missing data

source(here('globalprep/np/v2019/R/np_fxn.R'))

5 Import Raw Data: FAO Commodities

Simultaneously read and process FAO commodities value and quantity data.

## NOTE: This can be run as a loop, but the "value" and "quant" datasets need to be run individually to make sure
## there are no problems (after this check, they can be looped for efficiency)

dir_fao_data <- file.path(dir_M, 'git-annex/globalprep/_raw_data/FAO_commodities/d2019')

files <- list.files(dir_fao_data, pattern=glob2rx('*.csv'), full.names=TRUE)

for (f in files){ # f = files[2]
  cat(sprintf('\n\n\n====\nfile: %s\n', basename(f)))
  
  
  d <- read.csv(f, check.names=FALSE, strip.white=TRUE, stringsAsFactors = FALSE) #          stringsAsFactors=T
# checks names syntactically, strips leading and trailing whitespace, prevents conversion of characters to factors 

  
  ## Specifies that units are tonnes if we are reading in the Commodities Quantity data csv, and usd if we are reading in the Commodities Value data csv
  units <- c('tonnes','usd')[str_detect(f, c('quant','value'))] # detect unit name using lowercase American English

  ## gather into long format and clean up FAO-specific data foibles
  ## warning: attributes are not identical across measure variables; they will be dropped:   this is fine
  m <- d %>% 
    rename(country   = `Country (Country)`,
           commodity = `Commodity (Commodity)`,
           trade     = `Trade flow (Trade flow)`) %>%
    gather(year, value, -country, -commodity, -trade, -Unit)
  
  ## Include only the "Exports" data:
  m <- m %>%
    filter(trade == "Exports")

  m <- m %>%
    fao_clean_data() %>%  # swaps out FAO-specific codes. NOTE: optional parameter 'sub_0_0' can be passed to control how a '0 0' code is interpreted.
    select(-trade, -Unit) %>% # eliminate 'trade' column
  arrange(country, commodity, is.na(value), year)

  ## Products join: attach product categories from com2prod, and
  ##   filter out all entries that do not match a product category.
  ## Note: commodity_lookup is user-defined function to compare 
  ##   commodities in data vs commodities in lookup table
  
  ## load lookup for converting commodities to products
  com2prod <- read.csv(here('globalprep/np/v2019/raw/commodities2products.csv'), na.strings='')

  ## Check the current commodity-to-product lookup table.  If necessary, make changes to     "raw/commodities2products.csv"
  np_commodity_lookup(m, com2prod)
    
  ## inner_join will attach product names to matching commodities according to
  ## lookup table 'com2prod', and eliminate all commodities that do not appear in the lookup table.
  m <- m %>%
      inner_join(com2prod, by='commodity')
    
    
  ## Special case: user-defined function deals with 
  ##   breaking up Antilles into separate reported rgns
  m <- np_split_antilles(m)
    
  ## Some changes to region names that aren't working in name_2_rgn()
  m <- m %>%
    mutate(country = ifelse(country == "Côte d'Ivoire", "Ivory Coast", country)) %>%
    mutate(country = ifelse(country == "C<f4>te d'Ivoire    ", "Ivory Coast", country)) %>%
    mutate(country = ifelse(country == "C\xf4te d'Ivoire", "Ivory Coast", country)) %>%
    mutate(country = ifelse(country == "Cura<e7>ao","Curacao", country)) %>%
    mutate(country = ifelse(country == "Curaçao","Curacao", country)) %>%
    mutate(country = ifelse(country == "Cura\xe7ao","Curacao", country)) %>%
    mutate(country = ifelse(country == "R\xe9union", "Reunion", country)) %>% 
    mutate(country = ifelse(country == "Réunion", "Reunion", country)) %>% 
    mutate(country = ifelse(country == "R<e9>union", "Reunion", country)) %>% 
    filter(country != "Azerbaijan") # landlocked, but not being removed by name_2_rgn?
                   
    
  m_rgn <- name_2_rgn(df_in = m,
                      fld_name='country', 
                      flds_unique=c('commodity', 'product', 'year'))
    
# v2019 unmatched: eswatini, palestine, yugoslavia (all of these are landlocked)  
# v2019 duplicates: china, HK SAR, Macao SAR, Guadeloupe, Serbia and Montenegro, etc - these are addressed below in the group_by/summarize pipe
  
  ## combine composite regions
  ## When summarizing the dataset, this function provides a modified way to sum the value column while maintaining NA values when both variables are NA (rather than turning to zero values). The function will sum non-NA values normally.
  sum_function <- function(x) {
    if (sum(is.na(x)) == length(x)) 
      return(NA)
    return(sum(x, na.rm = T))}
  
  m_rgn <- m_rgn %>%
    group_by(rgn_id, rgn_name, commodity, product, year) %>%
    summarize(value = sum_function(value)) %>%
    ungroup()

# number of lines of data go down here^

## units: rename value field to units based on filename
  names(m_rgn)[names(m_rgn) == 'value'] <- units  
  
  ## output to .csv - should create two csvs (tonnes.csv and usd.csv) 
  harvest_out <- sprintf(here('globalprep/np/v2019/int/%s.csv'), units)
  write.csv(m_rgn, harvest_out, row.names = FALSE, na = '')
  
}

6 Data Wrangle

Combining the quantity and value data and a bit of cleaning to remove data prior to first reporting year for each commodity and region.

h_tonnes <- read.csv(here('globalprep/np/v2019/int/tonnes.csv'))

h_usd <- read.csv(here('globalprep/np/v2019/int/usd.csv'))

## concatenates h_tonnes and h_usd data
## h includes rgn_name, rgn_id, commodity, product, year, tonnes, usd.
commodities <- h_usd %>%
    full_join(h_tonnes, by=c('rgn_name', 'rgn_id', 'commodity', 'product', 'year')) %>%
    mutate(commodity = as.character(commodity)) %>%
    arrange(rgn_id, product, commodity, year)

filter(commodities, rgn_id==123 & year>2000)


## clips out years prior to first reporting year, for each commodity per region
commodities <- commodities %>% np_harvest_preclip

# Correct a seeming error in FAO data, when one value is >0 and the other is 0
# eg. rgn_id 153 Cook Islands, Ornamental saltwater fish, 2016, 12 USD and 0 tonnes

commodities <- commodities %>%
  mutate(usd = ifelse(tonnes>0 & usd==0, NA, usd)) %>%
  mutate(tonnes = ifelse(usd>0 & tonnes == 0, NA, tonnes))

7 Gapfilling

See issue #397 for details and debate and pretty graphs. Summary of gapfilling that is performed:

Zero-fill: for observations with NAs for both values (tonnes & usd), fill both as zero. Also cross-fills zeros where one value is zero, other is NA.
Regression fill, first pass: Where enough non-zero paired observations exist at the country level, use country-level data to create regression models (tonnes ~ usd and vice versa) for gapfilling. About 25% success.
Regression fill, second pass: Where pass 1 failed, and enough non-zero paired observations exist at georegional level, use georegional-level data to create regression models (tonnes ~ usd + year, and vice versa) for gapfilling. About 90% success.
Regression fill third pass: Where passes 1 and 2 failed, use global-scale data to create regression models (tonnes ~ usd + year, and vice versa) for gapfilling. 100% success.
End-fill: For years where NAs still exist in final year, carry forward data from prior year (after other gapfilling techniques).

commodities <- commodities %>% np_harvest_gapflag  
## Adds flag for required gap-filling, based upon NAs in data. 
## NOTE: Does not perform any gap-filling.
## At this point, h includes: 
## rgn_name   rgn_id   commodity   product   year   tonnes   usd   gapfill
## 'gapfill' will be in (zerofill, endfill, tbd, none)

data_check <- commodities %>% np_datacheck()
## for each commodity within each region, creates (but doesn't save...) summary info:
##   num_years:        the length of the data series for this commodity in this region
##   usd_unique_nz:    (or 'tns') number of unique non-zero values for usd or tonnes 
##   usd_na & tns_na:  number of NA occurrences
##   paired_obs:       number of non-zero paired observations
##   usd_unique_pairs: (or 'tns') within set of paired observations, count of unique usd and tonnes
##   unique_pairs:     lesser of usd_unique_pairs and tns_unique_pairs
##   count_no_data:    number of paired NAs - years with no value reported

commodities <- commodities %>% np_zerofill
## for post-reporting years with NA for both tonnes and USD, fill zero - 
## assumes that non-reporting indicates zero harvest to report.
## Also cross-fills zeros where one side is 0, other is NA (not flagged as gapfill)

commodities <- commodities %>% np_lowdata_filter()
## Exclude commodities (within a region) that have few non-zero data points during past 10 years.
## Optional parameter with default: nonzero_h_yr_min = 4
## NOTE: This filter has consequences for the regression, but also has meaning in terms of 
## not inflicting a penalty on regions trying, and then stopping, an experimental harvest.

UNgeorgn()

commodities <- commodities %>% 
  add_georegion_id() %>%
  select(-territory, -admin_rgn_id, -admin_country_name, -Notes) %>% 
    rename(georgn_id = r2_label)
## Melanie's script to add a georegional ID tag based on country keys and IDs. Used to gap-fill based on decreasing granularity. 

commodities <- commodities %>% np_regr_fill(years_back = 10, vars = 'td', scope = 'rgn_id')
commodities <- commodities %>% np_regr_fill(vars = 'tdy', scope = 'georgn_id')
commodities <- commodities %>% np_regr_fill(vars = 'tdy', scope = 'global')

## np_regr_fill() is a generalized regression gapfill function. Parameters (with defaults):
## * years_back=50 (int):     This determines how far back in the time series to include within the regression.
## * min_paired_obs=4 (int):  This determines how many paired observations are required to attempt a regression.
## * scope = 'rgn_id' (str):  ('rgn_id', 'georgn_id', 'global') Determines grouping scale for regression.
## * vars = 'tdy' (str):      ('td', 'tdy') Determines model: (tonnes ~ usd) or (tonnes ~ usd + year) [and vice versa]

commodities <- commodities %>% np_end_fill()
## For final year of data, if both usd and tonnes originally reported as NA, pull forward
## values for usd and tonnes from the previous year.  This should happen after regression fill.

h_comm <- commodities
## Store commodity-level data, before moving on to the product-level smoothing.

8 Final Data Wranglng

8.1 Summarize values

Summarize each product per country per year, e.g., all corals in Albania in 2011. And, do some error checking.

h_prod <- h_comm %>%
  group_by(rgn_name, rgn_id, product, year) %>%
  summarize(tonnes = sum(tonnes, na.rm = TRUE), 
            usd = sum(usd, na.rm = TRUE)) %>% 
  ungroup()

# v2019: Added ungroup() above

## Error-checking and table exports to see if there are duplicates
stopifnot(sum(duplicated(h_prod[ , c('rgn_id', 'product', 'year')])) == 0)

8.2 Quick Data Check

Look at wide format with all commmodities and product subtotal (where commodity column value is “Z_TOTAL”), compared with the input data prior to summing.

h_x_tonnes <- h_comm %>% 
  bind_rows(mutate(h_prod, commodity='Z_TOTAL')) %>%
  select(rgn_name, rgn_id, commodity, product, year, tonnes) %>%
  arrange(rgn_name, product, commodity, year) %>%
  spread(year, tonnes)

h_x_usd <- h_comm %>% 
  bind_rows(mutate(h_prod, commodity='Z_TOTAL')) %>%
  select(rgn_name, rgn_id, commodity, product, year, usd) %>%
  arrange(rgn_name, product, commodity, year) %>%
  spread(year, usd)

## Check a random country and commodity
australia <- h_x_usd %>% 
  filter(product == "shells", rgn_name == "Australia") 


## Compare old and new (v2019)
h_x_usd_old <- read_csv(here("globalprep/np/v2018/int/np_harvest_usd_wide.csv"))

aus_old <- h_x_usd_old %>% 
  filter(product == "shells", rgn_name == "Australia", commodity == "Z_TOTAL")

australia <- australia %>%
  filter(commodity == "Z_TOTAL") %>% 
  tidyr::gather(key = "year", value = "value_v2019", -(rgn_name), -(commodity), -(rgn_id), -(product))


aus_compare <- aus_old %>% 
  tidyr::gather(key = "year", value = "value_v2018", -(rgn_name), -(commodity), -(rgn_id), -(product)) %>% 
  full_join(australia, by = c("rgn_id","rgn_name","commodity","product", "year"))

## Can open up in Excel to compare subtotals per country-product-year
write.csv(h_x_tonnes, here('globalprep/np/v2019/int/np_harvest_tonnes_wide.csv'), row.names = FALSE, na = 'NA')
write.csv(h_x_usd,    here('globalprep/np/v2019/int/np_harvest_usd_wide.csv'),    row.names = FALSE, na = 'NA')

8.3 Calculate Rolling Averages

Determine rolling averages for tonnes and USD in order to determine peak values. This is based upon total harvests by product group, not individual commodity.

# Find max year in the summarized data table
year_max <- max(h_prod$year)

roll_prod <- h_prod %>%
  arrange(rgn_id, product, year) %>%
  group_by(rgn_id, product) %>%
  mutate(
      tonnes_rollmean = rollapply(tonnes, width=4, FUN=mean, align='right', partial=TRUE, na.rm=FALSE),
      usd_rollmean    = rollapply(   usd, width=4, FUN=mean, align='right', partial=TRUE, na.rm=FALSE)) %>%
  rename(
      tonnes_orig = tonnes, # prevent overwriting of reported and gapfilled values
      usd_orig    = usd) %>% # prevent overwriting of reported and gapfilled values
  mutate(
      tonnes = ifelse(!is.na(tonnes_rollmean), tonnes_rollmean, tonnes_orig),
      usd    = ifelse(!is.na(usd_rollmean),    usd_rollmean,    usd_orig)) %>%
  select(rgn_id, rgn_name, product, year, tonnes, usd, tonnes_orig, usd_orig)

8.4 Score Harvest Relative to Peaks

Score harvest (tonnes and usd) relative to peaks. Output values as .csvs. Perform this for all given scenarios, using a for loop.

buffer  <-  0.35 # 35% buffer (from OHI Methods)
recent_years  <-  10

## Find peak harvest per region-product and apply conservative buffer (scale down)
## Find max USD value over the last 10 years 
peak_prod <- roll_prod %>%
    group_by(rgn_id, product) %>%
    mutate(tonnes_peak = max(tonnes, na.rm=T)  * (1 - buffer)) %>%
    mutate(usd_peak = max(usd[year >= (year_max - recent_years)], na.rm=T)) %>%
    ungroup() 

## for each product, all years (within a region) have the same usd_peak values, but some years don't have all the products. Use the most recent year as this is considered the most current product list. 
prod_weights <- peak_prod %>%
    filter(year==year_max) %>% 
    group_by(rgn_id) %>%
    mutate(
      usd_peak_allproducts = sum(usd_peak, na.rm=T),
      prod_weight = usd_peak / usd_peak_allproducts) %>%
    ungroup() %>%
  mutate(year = year_max) %>% 
  select(rgn_id, year, product, weight = prod_weight)

## Determine relative status:
  smooth_prod <- peak_prod %>% 
    mutate(tonnes_rel = ifelse(tonnes >= tonnes_peak, 1, tonnes / tonnes_peak))

8.5 Save data layer

## Write entire data frame to .csv:
write.csv(smooth_prod, here('globalprep/np/v2019/int/np_harvest_smoothed_data.csv'), row.names = FALSE, na = '')

## Write individual data layers:
## Write NP weights layer also used to calculate pressures and resilience:
write.csv(prod_weights, here('globalprep/np/v2019/output/np_harvest_weights_from_usd.csv'), row.names = FALSE, na = '')

## Save tonnes data
tonnes <- smooth_prod %>%
  select(rgn_id, product, year, tonnes) 
write.csv(tonnes, here('globalprep/np/v2019/output/np_harvest_tonnes.csv'), row.names = FALSE, na = '')

## Save relative tonnes data
tonnes_rel <- smooth_prod %>%
  select(rgn_id, product, year, tonnes_rel) 
write.csv(tonnes_rel, here('globalprep/np/v2019/output/np_harvest_tonnes_rel.csv'), row.names = FALSE, na = '')

8.6 Final data check

Checking out some outliers with large changes in status values (found after uploading to global):

region_id	region name	change in status	explanation
67	Libya	+84	Looks like there may have been a difference in calculation or mistake made somewhere - the v2018 final output file np_harvest_tonnes.csv had NA for shells in 2013-2015, whereas the same data file for v2019 had 3-5 tonnes for this product over the same span (tonnes values were 0 - 1 for previous years in both files, so big jump). Odd because the raw data have matching numbers for tonnes, so maybe something got messed up in either this year or last year’s calculation (neither year gapfilled these data).
153	Cook Islands	+27	No differences in raw data, but values for ornamentals are different between v2018 and v2019- minorly for 2010-11, and more significantly between 2012-2015. This is odd because the raw data for both v2018 and v2019 has 0 for ornamental fish nei and and NA for ornamental saltwater fish (the only ornamental items for this region) for 2012-2015. Note: ornamental saltwater fish values were gapfilled using UN georegions (in v2018 and v2019).
78	Lebanon	+24	Big changes to ornamentals from 2012-2015 between v2018 and v2019 (raw data had some fish oil values that were zero in v2018 that are now NA, otherwise all raw values matched up). Ornamental fish nei values were zerofilled from 2011-2013 , ornamental saltwater fish zerofilled in 2014
43	Kenya	+ 18	Again, big jump in ornamental tonnes from 2013-2015, with no changes in the raw data between v2018 and v2019. None of these data were gapfilled.
191	Iran	-30	In raw data, ornamental fish nei had zeroes for 2014-2015 in v2019 and NAs in v2019; no other changes, including in output tonnes data between v2018 and v2019 (must have gapfilled based on previous years’ data).

OHI 2019: Natural Products

Compiled on Thu Aug 22 10:43:39 2019 by mwilliams