ohi logo
OHI Science | Citation policy

[REFERENCE RMD FILE: https://cdn.rawgit.com/OHI-Science/ohiprep/master/globalprep/tr/v2017/tr_dataprep.html]

1 Summary

This document describes the steps for obtaining the data used to calculate the tourism and recreation goal for the 2018 global assessment.

The general calculation is: tr = Ep * Sr * Tw and Xtr = tr/90th quantile across regions

The following data are used:

2 Updates from previous assessment

In 2017 we discovered that the WEF-Economics Global Competitiveness Index data used to estimate sustainability is not compatible across years. New methods are used each year and previous year’s of data are not recalculated using the updated methods. Consequently, we use the most recent data for all scenario years.

Travel Warning data form the U.S State department has updated the way data is reported. This year all countries have a travel warning that ranges from level 1 (normal precautions) to level 4 (do not travel). Within each general warning for a country there are regional warnings that are obtain by clicking in each country. In past assesmente we identified subregional warnings and modified the penalty based on this information. This year, because the new way of reporting the data, we will no longer apply this method. We eliminated the “subregion” information/penalties and modified past data to be consistent with our new approach. Future data should be collected without including this information. More information on how to obtain the data in the traver_warning section (scroll down).

The TTCI data is available to download in an excel file directly from the source. File was download on 07/12/2018 and saved in Mazu as a csv. No changes in data.

We were able to update the following data: * Tourism Sustainability - TTCI form WEF - No updates * Proportion of jobs in tourism - WTTC data reported until 2017 and unclear if 2018 is actual data or prediction so we used 2017 as the data_year (downloaded: 07/12/2018) * Travel warnings for 2018 (downloaded: U.S State Department 06/28/2018: Canada’s Governement 07/06/2018)

3 Some code to set everything up

#setwd("~/github/ohiprep_v2018/globalprep/tr/v2018") #comment out when knitting
#setwd("globalprep/tr/v2018")

#library(devtools)
#devtools::install_github("ohi-science/ohicore@dev")
library(ohicore)
library(tidyverse)
library(stringr)
library(WDI)

source('https://raw.githubusercontent.com/OHI-Science/ohiprep_v2018/master/src/R/common.R')

## maximum year of wttc data:
year_max <- 2017

source('R/tr_fxns.R')

4 Ep: Proportion of workforce directly employed in tourism

These data are from the World Travel & Tourism Council (http://www.wttc.org/). We use “direct” employment data (eee mazu: globalprep/_raw_data/WTTC/d2018/README.md for instructions on obtaining data). The data extend to 2028, but these values are projections. The actual data goes to 2018.

These data are cleaned and formatted using the R/process_WTTC.R script. Missing values are gapfilled using the UN georegion information.

## describe where the raw data are located:
dir_wttc <- file.path(dir_M, 'git-annex/globalprep/_raw_data/WTTC/d2018/raw')

## processing script that formats the WTTC for OHI, saves the following: intermediate/wttc_empd_rgn
source('R/process_WTTC.R', local = TRUE)

## read in the dataset created by above function:
tr_jobs_pct_tour <- read.csv('intermediate/wttc_empd_rgn.csv', stringsAsFactors = FALSE) %>%
 dplyr::select(rgn_id, year, jobs_pct)

## format data to have complete years/regions and convert percentage of jobs to proportion of jobs
rgn_names <- read.csv('https://raw.githubusercontent.com/OHI-Science/ohi-global/draft/eez/spatial/regions_list.csv', stringsAsFactors = FALSE) %>%
    dplyr::select(rgn_id)

# create data frame with a row for each combination of region id and year
rgn_names <- expand.grid(rgn_id = rgn_names$rgn_id, 
                             year= min(tr_jobs_pct_tour$year):max(tr_jobs_pct_tour$year)) 
      
tr_data_raw <- rgn_names %>%
    full_join(tr_jobs_pct_tour %>%
                rename(Ep = jobs_pct) %>%
                mutate(Ep = Ep/100) %>%
                mutate(Ep = ifelse(Ep > 1, NA, Ep)),
              by = c('rgn_id', 'year')) # Some regions (rgn_ide 32, 52 and 140 in the 2018 assesment) appear to have an error bteween some years, Ep value > 100%. This line makes this values NA


## gapfill missing data using UN georegion data:
georegions       <- georegions
georegion_labels <- georegion_labels

tr_data_raw <- tr_data_raw %>%
  left_join(georegions, by = 'rgn_id') %>%
  left_join(georegion_labels, by = 'rgn_id') %>%
  select(-r0) %>%
  filter(rgn_id != c(255, 213)) # ditch disputed regions and Antarctica

# Calculate two different gapfill columns using r2 and r1
tr_data_raw_gf <- tr_data_raw %>%
  group_by(year, r2) %>%
  mutate(Ep_pred_r2 = mean(Ep, na.rm=TRUE)) %>%
  ungroup() %>%
  group_by(year, r1) %>%
  mutate(Ep_pred_r1 = mean(Ep, na.rm=TRUE)) %>%
  ungroup()

# first gapfill with r2, if no value available use r1; create column indicating whether value was gapfilled and if so, by what method.
tr_data_raw_gf <- tr_data_raw_gf %>%
  mutate(Ep_all = ifelse(is.na(Ep), Ep_pred_r2, Ep)) %>%
  mutate(Ep_all = ifelse(is.na(Ep_all), Ep_pred_r1, Ep_all)) %>% 
  mutate(gapfilled = ifelse(is.na(Ep) & !is.na(Ep_all), "gapfilled", NA)) %>%
  mutate(method = ifelse(is.na(Ep) & !is.na(Ep_pred_r2), "UN georegion (r2)", NA)) %>%
  mutate(method = ifelse(is.na(Ep) & is.na(Ep_pred_r2) & !is.na(Ep_pred_r1), "UN georegion (r1)", method)) 

# save the data
tr_data_gf <- tr_data_raw_gf %>%
  select(rgn_id, year, gapfilled, method) 

write.csv(tr_data_gf, "output/tr_jobs_pct_tourism_gf.csv", row.names=FALSE)

tr_data <- tr_data_raw_gf %>%
  select(rgn_id, year, Ep=Ep_all) 

write.csv(tr_data, "output/tr_jobs_pct_tourism.csv", row.names=FALSE)


## A quick check to make sure last year's values aren't too crazy different
## (NOTE: the source data has been updated, so there are some changes, but they shouldn't be super different)

old <- read.csv('../v2017/output/tr_jobs_pct_tourism.csv')%>%
  select(rgn_id, year, ep_old=Ep)
new <- read.csv('output/tr_jobs_pct_tourism.csv') %>%
  left_join(old) %>%
  filter(year==2017) %>%
  arrange(ep_old)

plot(new$Ep, new$ep_old)
abline(0,1, col="red")
identify(new$Ep, new$ep_old, labels = new$rgn_id)

## NOTE: This looks reasonable.

5 Tw: Travel warnings

Primary source of information is from the U.S. State Department (https://travel.state.gov/content/passports/en/alertswarnings.html)

Secondary source of information: Canada’s Government - Travel Advice and Advisories (https://travel.gc.ca/travelling/advisories)

5.0.0.1 2018 Assesment

the U.S. State Department updated the way they report the data. This year warnings are reported for every country (not only the ones under a certain risk as it used to be). They now provide a numeric scale describing the level of warning rather than just keywords.

Data was download on 06/28/2018

We also looked into Canada’s Governement travel warning in order to make sure we capture as many OHI regions as possible. The Canadian government reports 18 more ohi regions including the U.S and U.S territories.

Data was download on 07/06/2018

Previously, we identified subregional warnings and modified the penalty based on this information. However, given the State Department’s new approach, we no longer will apply this method. We eliminated the “subregion” information/penalties and modified past data to be consistent with our new approach. Future data should be collected without including this information.

5.0.0.2 Notes about getting data:

For future assesmentes It would be worthwhile to see if data can be “scraped” directly from the website into R. This seems possible given the new format of the state department travel warning data.

5.0.0.3 Getting data for 2018 assessment

The following code is used transform the warnings into a multiplier that is used to calculate tourism and recreation scores:

Step 1: Copy and paste the data for each country: from https://travel.state.gov/content/passports/en/alertswarnings.html (for 2019 evalute scraping data directly from website). Paste into an excel file, convert to .csv and uploade to raw folder

Step 2: Wrangle and clean the new data

##Reading and wrangling 2018 data

warn_raw <- read.csv('raw/tr_travelwarning_2018_raw.csv', na.strings = " ") %>% 
  mutate(country = as.character(country)) %>% 
  mutate(level = as.numeric(str_extract(level, '[1,2,3,4]'))) %>% 
  filter(!(regional %in% 1)) %>% # no longer relevant because we are not adjusting penalty based on subregional information
  select(assess_year, level, country) %>% 
  rename(year= assess_year)
 
##Correct regions that are reported together

french_indies <- data.frame(country="French West Indies", 
                            country_new =c("Northern Saint-Martin", "Guadeloupe and Martinique")) %>%
  left_join(filter(warn_raw, country=="French West Indies")) %>%
  select(country=country_new, year, level)

BES <- data.frame(country="Bonaire, St. Eustatius, and Saba (BES)", 
                            country_new =c("Bonaire", "Saba", "Sint Eustatius")) %>%
  left_join(filter(warn_raw, country=="Bonaire, St. Eustatius, and Saba (BES)")) %>%
    select(country=country_new, year, level)

FandM <- data.frame(country="France *Monaco", 
                            country_new =c("France", "Monaco")) %>%
  left_join(filter(warn_raw, country=="France *Monaco")) %>%
    select(country=country_new, year, level)

line <- data.frame(country="Line Islands (Kiribati)", 
                            country_new =c("Line Group", "Phoenix Group")) %>%
  left_join(filter(warn_raw, country=="Line Islands (Kiribati)")) %>%
    select(country=country_new, year, level)

warn_improved <- filter(warn_raw, country != "French West Indies") %>%
  bind_rows(french_indies) 

warn_improved <- filter(warn_improved, country != "Bonaire, St. Eustatius, and Saba (BES)") %>%
  bind_rows(BES) 

warn_improved <- filter(warn_improved, country != "France *Monaco") %>%
  bind_rows(FandM) 

warn_improved <- filter(warn_improved, country != "Line Islands (Kiribati)") %>%
  bind_rows(line) 


##Correct names for regions not identified by the name_2_region
warn_improved[str_detect(warn_improved$country,"union"), ] #use to check names

warn_improved <- warn_improved %>% 
  dplyr::mutate(country = ifelse(stringr::str_detect(country,"Israel"), "Israel", country)) %>%
  dplyr::mutate(country = ifelse(stringr::str_detect(country,"Bonaire"), "Bonaire", country)) %>%
  dplyr::mutate(country = ifelse(stringr::str_detect(country,"ROC"), "Republique du Congo", country)) %>% 
  dplyr::mutate(country = ifelse(stringr::str_detect(country,"Kinshasha"), "Democratic Republic of the Congo", country)) %>% 
  dplyr::mutate(country = ifelse(stringr::str_detect(country,"Domincan"), "Dominican Republic", country)) %>% 
  dplyr::mutate(country = ifelse(stringr::str_detect(country,"Honduras"), "Honduras", country)) %>%
  dplyr::mutate(country = ifelse(stringr::str_detect(country,"Sao Tome"), "Sao Tome and Principe", country)) %>%
  dplyr::mutate(country = ifelse(stringr::str_detect(country,"Turks and Caicos"), "Turks and Caicos Islands", country)) %>%
  dplyr::mutate(country = ifelse(stringr::str_detect(country,"United Arab Emirates"), "United Arab Emirates", country)) %>% 
  dplyr::mutate(country = ifelse(stringr::str_detect(country,"U.S. Virgin Islands"), "Puerto Rico and Virgin Islands of the United States", country)) %>%
    dplyr::mutate(country = ifelse(stringr::str_detect(country,"Puerto Rico"), "Puerto Rico and Virgin Islands of the United States", country))

##NOTE: Puerto Rico and Virgin Islands are reported by the Canadian Goverment Travel Advisory (SEE BELOW befre runnig the script)

Step 3: Join new data with previous years’ assesment data (adjusting previous data to be consistent with this year’s data)

In 2018 we adjusted the data for provious years and saved the complete warning list up to 2018 in the intermediate folder (warning_complete.csv). This file should be combined with the data downloaded in 2019 (if data is reported in the same format)

We should be able to skip this section in the 2019 OHI assessment.

apply_level = data.frame(warning = c("inform", "risk", "avoid_nonessential", "avoid_all", "gtfo"), level = c(1, 2, 3, 4, 4)) 

## deleted a couple regions that were duplicated (2016 Haiti and 2017 Kenya)
warn_old <- read.csv('../v2018/raw/tr_travelwarnings_2017.csv') %>% 
  select(year = assess_year, country=rgn_name, inform, risk, avoid_nonessential, avoid_all, gtfo, regional) %>%
  gather("warning", "n", 3:7)  %>%
  filter(!is.na(n)) %>%
  select(-n) %>% 
left_join(apply_level, by="warning") %>%
  select(-warning) %>% 
  arrange(desc(year), country) 

## The old data does not always have data for the larger region when there was a "regional" warning.
## Adding that data here:

warn_old_add_rgns <- warn_old %>%
  rowwise() %>%
  mutate(country_year = paste(country, year, sep="_")) %>%
  arrange(year, country) %>%
  group_by(year, country) %>%
  mutate(region_data = sum(is.na(regional)))

#filtering for the countries that only have one regional warning in order to give the country as a whole a level 1
warn_old_add_rgns <- filter(warn_old_add_rgns, region_data==0) %>% 
  mutate(regional = 0) %>%
  mutate(level = 1) %>%
  select(year, country, level)

## exclude subregion data from old data
warn_old_no_subs <- filter(warn_old, !(regional %in% 1)) %>%
  select(year, country, level)

#binding old a new data
warn_complete <- warn_improved %>% #2018 data
  bind_rows(warn_old_no_subs) %>% #old with no regional
  bind_rows(warn_old_add_rgns)  #old with regional warning transformed into a level 1 for the whole country


##Save warn_complete. This file will be the one that needs to be combined with the new warnings in 2019.

write.csv(warn_complete, "intermediate/warning_complete.csv", row.names=FALSE)

Step 4: Transform the warnings into a multiplier that is used to calculate tourism and recreation scores.

Travel warning Multiplier Description
Level 1 1 (no penalty) Exercise Normal Precautions: This is the lowest advisory level for safety and security risk. There is some risk in any international travel.
Level 2 0.75 Exercise Increased Caution: Be aware of heightened risks to safety and security.
Level 3 0.25 Reconsider Travel: Avoid travel due to serious risks to safety and security.
Level 4 0 (full penalty) Do Not Travel: This is the highest advisory level due to greater likelihood of life-threatening risks.
scores <-  data.frame(level = c(1, 2, 3, 4), multiplier = c(1, 0.75, 0.25, 0)) 

warn_multiplier <-  warn_complete %>%  
  left_join(scores, by="level") %>% 
  group_by(year, country) %>%
  mutate(warning_count = n()) %>%
  ungroup()

# in general should be no regions with more than one warning count, \
# but Puerto Rico and Virgin Islands are fine because they were reported separately
# these will be averaged
filter(warn_multiplier, warning_count>1)

warn_multiplier <- warn_multiplier %>%
  group_by(year, country) %>%
  summarize(multiplier = mean(multiplier))

#Save file with multiplier
write.csv(warn_multiplier, "intermediate/warning.csv", row.names=FALSE)

Step 5: Convert names to OHI regions and clean.

warn <- read.csv("intermediate/warning.csv")

#Add rgn_id
warn_rgn <- name_2_rgn(df_in = warn, 
                       fld_name='country', 
                       flds_unique=c('country','year'))

# Check to see if any regions are duplicated:
sort(table(paste(warn_rgn$year, warn_rgn$rgn_id)))
# China has multiple warnings (rgn_id 209)

# Average China warnings
warn_rgn <- warn_rgn %>%
  group_by(rgn_id, rgn_name, year) %>%
  summarize(multiplier = mean(multiplier)) %>%
  ungroup()

sort(table(paste(warn_rgn$year, warn_rgn$rgn_id)))

Step 6: Identify inhabited regions with no U.S. travel warning data and see if this information is available from the Canadian Government website and add warnings to the raw data csv. Make sure to label the column indicting the data source.

Next year, consider gapfilling missing territorial regions with administrative country data.

NOTE: Canada’s warning have the same levels as the U.S

georegion_labels <- georegion_labels

##uninhabited regions (no need to worry about these)
uninhabited <- read.csv('https://raw.githubusercontent.com/OHI-Science/ohiprep_v2018/master/src/LookupTables/rgn_uninhabited_islands.csv') %>% 
  filter(est_population < 100) %>% # for this purpose: consider regions with <100 people to be uninhabited
  select(rgn_id)

##Filter for information in current assesment year
filter_no_us_warn <- warn_rgn %>% 
  filter(year == 2018)

##All inhabited regions that do not have a travel warning
no_us_warn <- georegion_labels %>%
  dplyr::left_join(filter_no_us_warn, by = 'rgn_id')%>% 
  filter(is.na(multiplier)) %>% 
  filter(!(rgn_id %in% uninhabited$rgn_id)) %>% 
  select(rgn_id, rgn_name=rgn_label)

##Save regions with no data in intermediate folder
write.csv(no_us_warn, 'intermediate/v2018_no_data.csv', row.names=FALSE)

Final step is to compare with previous year’s data.

Many European regions now have a travel warning due to increased terrorism (e.g., United Kingdom, Italy, Spain, Germany), although this doesn’t show up in the following figure because previously, these regions had no travel warning (and were thus, NA).

The change in not penalizing subregional warnings tended to reduce the penalty (i.e. increase the multiplier value).

# The following indicates changes over time as well as changes to the State Department's approach to quantifying risk
tmp <- warn_rgn %>%
  spread(year, multiplier) %>%
  data.frame()

plot(jitter(tmp$X2017), jitter(tmp$X2018))
abline(0,1, col="red")


### compare against last year's data
### these changes will reflect changes due to removing subregional warnings:
old <- read.csv("../v2017/output/tr_travelwarnings.csv") %>%
  filter(year==2017) %>%
  left_join(tmp, by="rgn_id")


plot(jitter(old$multiplier), jitter(old$X2017))
abline(0,1, col="red")

5.0.1 Save the travel warning data in the output folder

georegions <- georegion_labels %>%
  select(rgn_id)
  

warn_rgn_spread <- warn_rgn %>%
  spread(year, multiplier) %>%
  full_join(georegions, by=c("rgn_id")) %>%
  data.frame() %>%
  gather(year, multiplier, starts_with("X")) %>%
  mutate(year = gsub("X", "", year)) %>%
  mutate(multiplier = ifelse(is.na(multiplier), 1, multiplier)) %>% #wth a multiplier 1 to all regions with no warnings
  filter(rgn_id <= 250) %>%
  filter(rgn_id != 213) 
  
#should by 220 for each year
table(warn_rgn_spread$year)

warn_rgn_all_rgns <- warn_rgn_spread %>%
  select(rgn_id, year, multiplier) %>%
  arrange(year, rgn_id)

write.csv(warn_rgn_all_rgns, 'output/tr_travelwarnings.csv', row.names=FALSE)

## Create gapfill file. No gapfill in this case so gapfill = 0

travelwarning_gf <- read.csv("output/tr_travelwarnings.csv") %>% 
  mutate(multiplier = 0) %>% 
  rename(gapfilled = multiplier)

write.csv(travelwarning_gf, 'output/tr_travlewarnings_gf.csv', row.names = FALSE)

6 Ts: Tourism sustainability

These data are from the World Economic Forum’s “Travel and Tourism Competitiveness Report” (http://reports.weforum.org/travel-and-tourism-competitiveness-report-2017/downloads/) See mazu: _raw_data/WEF-Economics/ for more details and the raw data.

These data are not compatible across years, so only the most recent year of data is across scenarios.

These data are gapfilled using gdppcppp and UN georegion information (see next section for obtaining and preparing these data).

# read in files
ttci_raw <- read.csv(file.path(dir_M, "git-annex/globalprep/_raw_data/WEF-Economics/d2018/WEF_TTCR17_data_for_download.csv")) %>%
  dplyr::filter(row_number() %in% c(2,598)) %>% #Identified rows and colums of interest
  dplyr::select(11:156) %>% 
  t() %>%
  data.frame() %>% 
  remove_rownames() %>%
  rename(country = "X1", score= "X2") %>% 
  mutate(score = as.numeric(as.character(score))) %>% 
  mutate(country = as.character(country))
  
#Checking the imported data is the same than in 2017! 
ttci_v2017 <-read.csv(file.path(dir_M, "git-annex/globalprep/_raw_data/WEF-Economics/d2017/wef_ttci.csv")) %>% 
  rename(score_old= score) %>%
  arrange(country) %>% 
  left_join(ttci_raw)

plot(ttci_v2017$score, ttci_v2017$score_old)
abline(0,1, col="red") #Yes! This means data has not been updated. The only difference is that now we are importing the raw data from an excel file instead of a pdf


#Changing names that are not recognized by ohicore
ttci <- ttci_raw %>%
    mutate(country = ifelse(country == "Congo, Democratic Rep.", "Democratic Republic of the Congo", country)) %>%
    mutate(country = ifelse(str_detect(country, "Ivoire"), "Ivory Coast", country))
  
  
ttci_rgn <- name_2_rgn(df_in = ttci, 
                       fld_name='country')

##Duplicated regions weighted mean
weight_data <- data.frame(country = c("China", "Hong Kong SAR"),
                          population = c(1379000000, 7347000))

ttci_rgn <- ttci_rgn %>%
  arrange(country) %>%
  left_join(weight_data, by = "country") %>%
  mutate(population = ifelse(is.na(population), 1, population)) %>%
  group_by(rgn_id, rgn_name) %>%
  summarize(score = weighted.mean(score, population)) %>%
  select(rgn_id, rgn_name, score)

head(ttci_rgn, 10)

### Save TTCI data file
write_csv(ttci_rgn, 'intermediate/wef_ttci.csv')

6.1 Preparing the gdppcppp data:

These data are used to gapfill missing values in tourism sustainability. Most of the data are from the World Bank, but CIA data fill some gaps (CIA data is available for only the most recent year).

The Artisanal Opportunities goal uses gdppcppp data, so we will get the data that was processed for that goal. (NOTE: Update the Artisanal Opportunities goal prior to preparing these data)

wb <- read.csv("../../ao/v2018/intermediate/gdppcppp_ohi.csv") %>%
  dplyr::select(rgn_id, year, value)

CIA data are used to fill in missing gaps in the gdppcppp data (https://www.cia.gov/library/publications/the-world-factbook/rankorder/2004rank.html)

Downloaded: 7/16/2018

See README on the raw forlder for instructions on how to download this data. Note: for next year assesment, evaluate the option of scraping directly from the source website.

The following code is used to prepare these data for OHI:

cia_gdp <- read.csv('raw/cia_gdp_pc_ppp.csv', stringsAsFactors = FALSE, header = FALSE) %>% 
  separate(V1, c("number", "other"), sep = " ", extra = "merge") %>% 
  separate(other, c("country", "gdppcppp"), sep = "\\$") %>%
  mutate(country = str_trim(country, side = "both")) %>% 
  mutate(country = noquote(country)) %>% 
  mutate(country = as.character(country)) %>% 
  mutate(gdppcppp = as.numeric(gsub(",", "", gdppcppp))) %>% 
  select(-number) %>% 
  data.frame()

##Data reported in a lower resolution than OHI regions
splits <- data.frame(country = "Saint Helena, Ascension, and Tristan da Cunha", country2 = c("Saint Helena",                                                                "Ascension","Tristan da Cunha")) %>%
  mutate(country = as.character(country),
         country2 = as.character(country2))

cia_gdp <- cia_gdp %>%
  left_join(splits, by='country') %>%
  mutate(country2 = ifelse(is.na(country2), country, country2)) %>%
  select(country=country2, pcgdp_cia = gdppcppp)


cia_gdp_rgn <- name_2_rgn(df_in = cia_gdp, 
                       fld_name='country')

### Duplicated regions: Collapse regions after weighting by population (regions we include as a single region) - 

population_weights <- data.frame(country = c("Virgin Islands", "Puerto Rico",
                                             "China", "Hong Kong", "Macau",
                                             "Guam", "Northern Mariana Islands"),
                                 population = c(106405, 3725789,
                                         1339724852, 7071576, 636200,
                                         162896, 55023))

cia_gdp_rgn <- cia_gdp_rgn %>%
  left_join(population_weights, by="country") %>%
  mutate(population = ifelse(is.na(population), 1, population)) %>%
  group_by(rgn_id) %>%
  summarize(pcgdp_cia = weighted.mean(pcgdp_cia, population)) %>%
  ungroup() %>%
  filter(rgn_id <= 250) %>%
  select(rgn_id, pcgdp_cia)

write.csv(cia_gdp_rgn, "intermediate/wb_rgn_cia_GDPPCPPP.csv", row.names=FALSE)

The following code combines the two gdp datasets and gapfills missing regions using UN georegions.

If there is no World Bank gdppcppp data (pcgdp), the CIA data is used (pcgdp_cia). The pcgdp2 variable includes both the World Bank and CIA data (with CIA data only used if there is not World Bank data). The remaining data are estimated using UN geopolitical regions. Ideally, the mean gdppcppp value is calculated at the r2 scale (gdp_pred_r2) using regions within each class with gdppcppp data. If there were not enough regions with data at the r2 scale, the average at the r1 scale was used (gdp_pred_r1). The gdp_all variable combines all estimates using the following heirarchy: World Bank -> CIA -> estimated using mean from r2 UN geopolitical regions -> estimated using mean from r1 UN geopolitical regions.

### world bank gdp data
gdppcppp <- wb %>%
  select(rgn_id, year, pcgdp = value)

### cia gdp data
gdppcppp2 <- read.csv('intermediate/wb_rgn_cia_GDPPCPPP.csv')


### Use WB data, but if missing, use pcgdp_cia.
### combine with UN georegion data
years <- data.frame(year = min(gdppcppp$year):max(gdppcppp$year))

georegions <- ohicore::georegions

regions <- georegions %>%
  left_join(georegion_labels, by = 'rgn_id')

gdp_raw <- merge(years, regions, by=NULL) %>%
   left_join(gdppcppp, by = c('rgn_id', 'year')) %>%
  left_join(gdppcppp2, by = c("rgn_id")) 

## quick compare to make sure the CIA and World Bank data are compatible
plot(gdp_raw$pcgdp[gdp_raw$year==2017], gdp_raw$pcgdp_cia[gdp_raw$year==2017])
abline(0,1, col="red")
# a bit of scatter for a few regions, but overall looks good

gdp_raw <- gdp_raw %>%
  mutate(pcgdp2 = ifelse(is.na(pcgdp), pcgdp_cia, pcgdp))

##Calculating the means across different geopolitical levels (e.g. r2, r1)
gdp_raw <- gdp_raw %>%
  group_by(r2, year) %>%
  mutate(gdp_pred_r2 = mean(pcgdp2, na.rm=TRUE)) %>%
  ungroup() %>%
  group_by(r1, year) %>%
  mutate(gdp_pred_r1 = mean(pcgdp2, na.rm=TRUE)) %>%
  ungroup() 

gdp_raw_gf <- gdp_raw %>%
  mutate(gdp_all = ifelse(is.na(pcgdp2), gdp_pred_r2, pcgdp2)) %>%
  mutate(gdp_all = ifelse(is.na(gdp_all), gdp_pred_r1, gdp_all)) %>%
  mutate(gapfilled = ifelse(is.na(pcgdp2) & !is.na(gdp_all), "gapfilled", NA)) %>%
  mutate(method = ifelse(is.na(pcgdp2) & !is.na(gdp_pred_r2), "UN georegion (r2)", NA)) %>%
  mutate(method = ifelse(is.na(pcgdp2) & is.na(gdp_pred_r2) & !is.na(gdp_pred_r1), "UN georegion (r1)", method)) 

write_csv(gdp_raw_gf, "intermediate/gdp_raw_gf.csv")

gdp_data_gf <- gdp_raw_gf %>%
  select(rgn_id, year, gapfilled, method) 

write_csv(gdp_data_gf, "intermediate/gdp_gf.csv")

gdp_data <- gdp_raw_gf %>%
  select(rgn_id, year, pcgdp = gdp_all)

write_csv(gdp_data, "intermediate/gdp.csv")

The final step is gapfilling the Sustainability data using a linear model with gdppcppp and UN geopolitical regions as predictor variables.

sust  <- read.csv('intermediate/wef_ttci.csv', stringsAsFactors = FALSE)

### don't need to gapfill data without tourism data:
## Most recent tourism data is 2018.  

ep_gf <- read.csv("output/tr_jobs_pct_tourism.csv") %>%
  filter(year == 2017) %>%
  select(rgn_id, Ep) %>%
  filter(!is.na(Ep))

# gdp dataframe prepared above (World Bank, CIA, and gapfilled gdp data)
gdp_raw_gf <- read.csv("intermediate/gdp_raw_gf.csv", stringsAsFactors = FALSE) %>% 
  filter(year == 2017) %>%
  select(rgn_id, r0_label, r1_label, r2_label, rgn_label, pcgdp, pcgdp_cia, pcgdp2, gdp_all) 

tr_sust <- gdp_raw_gf %>%
           left_join(sust, by = c("rgn_id")) %>%
          left_join(ep_gf, by = c("rgn_id")) %>%  
          rename(S_score = score) %>%
          filter(rgn_id != 213)

### Add gapfill flag variable 
##Reminder:
## pcgdp2: includes both the World Bank and CIA data (with CIA data only used if there is not World Bank data)
## Ep: Proportion of workforce directly employed in tourism
##S_score: tourism sustainability score

tr_sust_gf <- tr_sust %>%
  mutate(gapfilled = ifelse(is.na(S_score) & !is.na(Ep), "gapfilled", NA)) %>%
  mutate(method = ifelse(is.na(S_score) & !is.na(Ep) & is.na(pcgdp2), "lm georegion + gdppcppp, with est. gdppcppp", NA)) %>%
  mutate(method = ifelse(is.na(S_score) & !is.na(Ep) & !is.na(pcgdp2), "lm georegion + gdppcppp", method)) %>%
  select(rgn_id, gapfilled, method)

write.csv(tr_sust_gf, "output/tr_sustainability_gf.csv", row.names=FALSE)

6.1.1 Gapfilling

Linear models using gdppcppp and UN geopolitical regions as predictor variables. However if there is no gdppc data we estimate the gdppc using the UN georegions and then used in the linear model to gapfill the sustainability score.

### Gapfill S using r1 and/or r2 regional data and PPP-adjusted per-capita GDP
### Looked at models with a year variable, but wasn't significant and decided to exclude

mod3 <- lm(S_score ~ as.factor(r2_label) + gdp_all, data=tr_sust, na.action = na.exclude)
summary(mod3)
anova(mod3)

mod4 <- lm(S_score ~ as.factor(r1_label) + gdp_all, data=tr_sust, na.action = na.exclude)
summary(mod4)
anova(mod4)

plot(predict(mod3), tr_sust$S_score)
abline(0,1)
plot(predict(mod4), tr_sust$S_score)
abline(0,1)


## Estimate missing data and gapfill
# Some of the r1 levels do not have data and consequently causes a fail. This chink of code  drop these levels so an NA is returned

# Select only r2 column
new_data <- tr_sust %>% 
  dplyr::select(r2_label, gdp_all)

unique(tr_sust$r2_label)

r2_w_data <- unique(tr_sust$r2_label[!is.na(tr_sust$S_score)])
  
new_data_r2 <- new_data %>%
  mutate(r2_label = ifelse(r2_label %in% r2_w_data, r2_label, NA))

# Predict sustainability scores using linear model 3 (using r2 data)
tr_sust <- tr_sust %>% 
  dplyr::mutate(S_score_pred_r2 = predict(mod3, newdata = new_data_r2))


# Select only r1 column
new_data <- tr_sust %>% 
  dplyr::select(r1_label, gdp_all)

unique(tr_sust$r1_label)

r1_w_data <- unique(tr_sust$r1_label[!is.na(tr_sust$S_score)])

new_data_r1 <- new_data %>%
  mutate(r1_label = ifelse(r1_label %in% r1_w_data, r1_label, NA))

# Predict sustainability scores using linear model 4 (using r1 data)
tr_sust <- tr_sust %>% 
  dplyr::mutate(S_score_pred_r1 = predict(mod4, newdata = new_data_r1))



## some are missing the r1 predictions, but none of these have Ep scores, so not relevant
filter(tr_sust, is.na(S_score_pred_r1))

tr_sust <- tr_sust %>%
  mutate(S_score_2 = ifelse(is.na(S_score), S_score_pred_r2, S_score)) %>%
  mutate(S_score_2 = ifelse(is.na(S_score_2), S_score_pred_r1, S_score_2)) %>%
  mutate(year = '2018') %>%
#  filter(!is.na(Ep)) %>%
  select(rgn_id, year, S_score=S_score_2)

summary(tr_sust)

write_csv(tr_sust, "output/tr_sustainability.csv")

6.2 Compare with previous year of data

compare <- read.csv("../v2017/output/tr_sustainability.csv") %>% # change year
  select(rgn_id, old_S_score = S_score) %>%
  left_join(tr_sust, by = "rgn_id") %>%
  mutate(dif = old_S_score - S_score)

plot(compare$S_score, compare$old_S_score)
abline(0, 1, col="red")
#looks reasonable