ohi logo
OHI Science | Citation policy

REFERENCE RMD FILE

1 Summary

This document describes the steps for obtaining the data used to calculate the tourism and recreation goal for the 2023 global assessment.

The general calculation is: tr = Ap * Sr and Xtr = tr/90th quantile across regions

  • Ap = Proportion of arrivals to area of coastline to population (formerly Ep when it represented employment)
  • Sr = (S-1)/5; Sustainability of tourism

1.1 The following data are used:

  • Numbers of tourist arrivals, used in the calculation of the proportion of arrivals to area of coastline to population: obtained through the UNWTO (in the form of thousands of people). More info on tourism terms here. Range: 1995-2021 (only 2008-2021 is used)
  • Area of coastline, used in the calculation of the proportion of arrivals to area of coastline to population: calculated in the LSP goal. Range: Static
  • Population, used in the calculation of the proportion of arrivals to area of coastline to population: primarily uses World Bank data obtained through the WDI() function. Combines Our World in Data and Statista data (1, 2, and 3) (obtained on their respective websites) with this to get population data for all OHI regions with arrivals and area of coastline data. Range: World Bank - 1960-2022 (2008-2021 is selected and used); Our World in Data - 10,000 BCE-2021 (2008-2021 is used); Statista - 2011-2023 (2011-2021 is used; as of v2023, 3 OHI regions use this data).
  • Tourism sustainability: World Economic Forum. The Travel & Tourism Development Index 2021 dataset (version 24 May 2022). 2022. TTDI
  • Per capita GDP: (World Bank with gaps filled using CIA data), used to gapfill missing values in Tourism sustainability

2 Updates from previous assessment

2.1 Tourism sustainability

None in v2023. Copied data from v2022.

2.2 Tourist arrivals (formerly tourism employment)

The data for this layer has been paywalled. Because of this, we have replaced the WTTC data with UNWTO data. This change causes us to be a year behind on information relative to the old data source, but this is unavoidable. We have also decided to use UNWTO tourist arrivals data instead of UNWTO tourism employment data due to the employment data containing a significant amount of missing data and there being a lack of sufficient gapfilling methods.

We were able to update the following data:

  • Proportion of tourist arrivals to area of coastline to population - UNWTO data on thousands of tourist arrivals, reported until 2021 (downloaded here (dataset: “Total arrivals” under Inbound Tourism; we use the “Overnights visitors (tourists)” categorization as arrivals where possible) on 8/10/2023, used in combination with area of coastline calculated in the LSP goal, update completed on 6/26/2023, and population data primarily from World Bank using the WDI() function, downloaded each time the function is run (most recently used data pulled 9/12/2023), gapfilled with population data, downloaded on 9/7/2023, from Our World in Data and Statista (1, 2, and 3)

2.3 Initial set-up code

# library(devtools)
# devtools::install_github("ohi-science/ohicore@dev") # dont worry about devtools
library(ohicore)
library(tidyverse)
library(stringr)
library(WDI)
library(here)
library(janitor)
library(plotly)
library(readxl)
library(naniar)

version_year <- "2023"
prev_ver_yr <- as.character(as.numeric(version_year) - 1)

source(paste0("http://ohi-science.org/ohiprep_v", version_year, "/workflow/R/common.R"))
region_data()
regions_shape()
#source(here(paste0("globalprep/tr/v", version_year, "/R/tr_fxns.R"))) # not used presently

3 Ap: Proportion of tourist arrivals to area of coastline to population

We use international arrivals data from the United Nations World Tourism Organization (UNWTO). Up until the current assessment, we accessed data from the World Travel & Tourism Council (WTTC), but this is no longer a viable option. This is provided in the form of thousands of people arriving, which we divide by area of coastline for each country and further by population for each country. Missing values in arrivals (for our purposes, “Overnights visitors (tourists)”) are gapfilled using “Total arrivals” minus “Same-day visitors (excursionists)” (if available) or, secondarily, by upfilling or downfilling (prioritizing downfilling) years of data based on if the previous or later year has information. Missing values in the primarily-used population data source are gapfilled using two other data sources, and one data source’s years are gapfilled by upfilling. We were not able to further gapfill with other methods at this time.

3.0.1 Calculating the proportion of tourist arrivals to area of coastline to population

The number of arrivals to area of coastline is used as a proxy for the value that people have for experiencing and enjoying coastal areas. We gather tourist arrivals and make it a proportion by dividing them by the area of coastline and by the population for the associated OHI region.

3.0.2 Source cleaned and gapfilled arrivals, area of coastline, and population data sources

# source in cleaned UNWTO data for current version year (make sure to download from website and put on Mazu in the UNWTO folder first)
source(here(paste0("globalprep/tr/v", version_year, "/R/process_UNWTO_arrivals.R"))) # outputs unwto_dupe_fix_downup_gf

# source in prepared area of coastline data: uses current version year for these files by default (acquire up-to-date LSP data if applicable)
source(here(paste0("globalprep/tr/v", version_year, "/R/process_area_of_coastline.R"))) # outputs inland_offshore

# source in prepared populations data by ohi region: uses current version year for these files by default (download up-to-date data if applicable)
source(here(paste0("globalprep/tr/v", version_year, "/R/process_populations.R"))) # outputs combined_pops_filled

# check outputs of everything!

3.0.3 Divide tourist arrival count data by area of coastline and divide again by population

# divide the number of tourist arrivals by area of coastline and population to get the proportion
tourism_props <- unwto_dupe_fix_downup_gf %>%
  left_join(inland_offshore, by = c("rgn_id", "year")) %>%
  left_join(combined_pops_filled, by = c("rgn_id", "year")) %>%
  mutate(Ap = (tourism_arrivals_ct/total_inland_offshore_area/population)) %>%
  filter(year >= 2008) # filter to the years we are interested in for any data that isn't filtered yet

# check out things so far
summary(tourism_props) # should be 0 NAs for Ap if gapfilling worked

# remove unnecessary columns after checking summary
tourism_props <- tourism_props %>%
  select(-tourism_arrivals_ct, -total_inland_offshore_area, -population)

3.0.4 Rescale the proportions to be between 0 and 1

# rescale to make highest Ap value 1
quantile_marker <- quantile(tourism_props$Ap, probs = 0.90, na.rm = TRUE) # 90th quantile = 1
tourism_props$Ap_rescaled <- tourism_props$Ap / quantile_marker # divide all values by this value
tourism_props$Ap_rescaled <- pmin(tourism_props$Ap_rescaled, 1) # any above 1 still rescale to 1

# exploratory checks
hist(tourism_props$Ap_rescaled, main = "Distribution of Rescaled Ap Values") # check out the distribution
perfect_scores <- tourism_props %>% filter(Ap_rescaled == 1) # get the times the Ap = 1
unique(perfect_scores$rgn_id) # which countries have at least one Ap value = 1?

# make rescaled column the actual value column
tourism_props_rescaled <- tourism_props %>%
  select(-Ap) %>%
  rename(Ap = Ap_rescaled)

3.0.5 Removing low population / uninhabited regions

### after gap-filling, make sure low/uninhabited regions are NA
# create df for unpopulated/low populated regions
low_pop()
low_pop <- low_pop %>%
  filter(est_population < 3000 | is.na(est_population)) %>%  # filter out regions that have populations > or equal to 3000 and keep NA values 
  rename(rgn_label = rgn_nam)

summary(tourism_props_rescaled)
# v2020 371 NAs
# v2022 114 NAs
# v2023 0 NAs (because of gapfilling)

# make sure all the NAs are uninhabited regions
tourism_props_nas <- tourism_props_rescaled %>% 
  filter(is.na(Ap)) %>% 
  select(rgn_id, year) %>% # v2023: was rgn_id, year, r1_label, r2_label, rgn_label but we did not gapfill by georegion so did not have the last 3
  left_join(low_pop, by = "rgn_id") # v2023: was by = c("rgn_id", "rgn_label") but we did not gapfill by georegion so did not have rgn_label

tourism_props_nas %>% 
  filter(Inhabited == 0 & !is.na(est_population)) %>% 
  nrow() # 0 ✓

max(tourism_props_nas$est_population, na.rm=TRUE) < 3000 # should be true

# make sure all the uninhabited regions are NA and then drop them (all NA regions get added back as a last step just in case its relevant)
tourism_props_rescaled <- tourism_props_rescaled %>% 
  mutate(Ap = ifelse(rgn_id %in% low_pop$rgn_id, NA, Ap)) 


# check NAs once more 
summary(tourism_props_rescaled)
# v2019: Adding the low pop df identifies 13 additional regions that should be NA instead of gapfilled, taking the total number of NAs in the data set from 245 to 700
# v2020: Adding the low pop df takes the total number of NAs in the data set from 371 to 832
# v2022: Adding the low pop df takes the total number of NAs in the data set from 14 to 40
# v2023: Adding the low pop df takes the total number of NAs in the data set from 0 to 14

# after checking NAs, get rid of them
tourism_props_rescaled <- tourism_props_rescaled %>% 
  drop_na(Ap)

3.0.6 Write output files

# we want to make sure all OHI regions are present in the data, even if could not calculate Ap (this will add back in what we removed above as well as any region not present in the data)
year_range <- unique(tourism_props_rescaled$year) # get the year range of Ap
year_range_df <- data.frame(year = year_range) # make it a dataframe

eez_to_combine <- rgns_eez %>%
  select(rgn_id) %>% # get all OHI regions
  crossing(., year_range_df) # expand OHI regions so theres one entry for each applicable year

# we did not maintain info on regions with population data but no arrivals data since we joined population to regions with arrivals data, so let's get that back just to have complete info on our data
countries_w_pop_data_no_arrivals <- setdiff(combined_pops_filled$rgn_id, unwto_dupe_fix_downup_gf$rgn_id) # find regions this is the case for
countries_w_pop_data_no_arrivals_df <- combined_pops_filled %>%
  filter(rgn_id %in% countries_w_pop_data_no_arrivals) %>%
  select(rgn_id, year, population) # get population data for the regions with no arrivals data but yes population data, that way later these will be the only regions with that data and we can use that to set population_method
  
  
tourism_props_rescaled_eez <- eez_to_combine %>%
  left_join(countries_w_pop_data_no_arrivals_df, by = c("rgn_id", "year")) %>% # add in the population data to later create new column
  left_join(tourism_props_rescaled, by = c("rgn_id", "year")) %>% # add this to all regions, so we get Ap values for regions that have them and NAs for ones that don't
  mutate(population_method = ifelse(!is.na(population), "WDI-WB", population_method)) %>% # set method based on having population data (this will be the regions with no arrivals data but yes population data)
  select(-population) %>% # don't need this anymore
  mutate(coastline_method = "LSP",
         coastline_gapfilled = NA) # adding gf columns for coastline data for consistency but we have data for all OHI regions, so it's all the same
  

# save gapfill info
tourism_props_gf_to_write <- tourism_props_rescaled_eez %>%
  select(-Ap) # don't need actual values for the gapfill information

write_csv(tourism_props_gf_to_write, here(paste0("globalprep/tr/v", version_year, "/output/tr_arrivals_props_tourism_gf.csv")))

# save gap-filled data
tourism_props_to_write <- tourism_props_rescaled_eez %>%
  select(rgn_id, year, Ap) # don't need gf info here, just the values

write_csv(tourism_props_to_write, here(paste0("globalprep/tr/v", version_year, "/output/tr_arrivals_props_tourism.csv")))

3.0.7 Look at changes in recent years

We would expect for tourism jobs to decrease across the board from 2019 and 2020 given the pandemic, and likely see a rebound to some extent between 2020 and 2021 — let’s make sure that’s reflected in our results.

tourism_props_compare <- tourism_props_to_write %>%
  mutate(year = as.numeric(as.character(year))) %>%
  filter(year >= 2019) %>%
  pivot_wider(names_from = year, values_from = Ap)

# compare 2019 and 2020
plot(tourism_props_compare$"2019", tourism_props_compare$"2020",
     xlab = "v2023 2019 Arrivals Proportion", ylab = "v2023 2020 Arrivals Proportion")
abline(0, 1) # more data below the line

# compare 2020 and 2021
plot(tourism_props_compare$"2020", tourism_props_compare$"2021",
     xlab = "v2023 2020 Arrivals Proportion", ylab = "v2023 2021 Arrivals Proportion")
abline(0, 1) # more dara above the line

Everything looks reasonable.

3.0.8 Look at changes vs. previous data source (v2023)

new_data <- read_csv(paste0("globalprep/tr/v", version_year, "/output/tr_arrivals_props_tourism.csv"))
old_data <- read_csv(paste0("globalprep/tr/v", prev_ver_yr, "/output/tr_jobs_pct_tourism.csv"))

compare_common_data <- new_data %>%
  left_join(old_data, by = c("rgn_id", "year")) %>%
  drop_na()

plot(compare_common_data$Ep, compare_common_data$Ap,
     xlab = "v2022 Employment Proportion", ylab = "v2023 Arrivals Proportion")
abline(0, 1)




compare_common_data_2021 <- new_data %>%
  left_join(old_data, by = c("rgn_id", "year")) %>%
  drop_na() %>%
  filter(year == 2021)

plot(compare_common_data_2021$Ap, compare_common_data_2021$Ep,
     xlab = "v2023 Arrivals Proportion", ylab = "v2022 Employment Proportion")
abline(0, 1)

compare_common_data_2020 <- new_data %>%
  left_join(old_data, by = c("rgn_id", "year")) %>%
  drop_na() %>%
  filter(year == 2020)

plot(compare_common_data_2020$Ap, compare_common_data_2020$Ep,
     xlab = "v2023 Arrivals Proportion", ylab = "v2022 Employment Proportion")
abline(0, 1)

compare_common_data_2019 <- new_data %>%
  left_join(old_data, by = c("rgn_id", "year")) %>%
  drop_na() %>%
  filter(year == 2019)

plot(compare_common_data_2019$Ap, compare_common_data_2019$Ep,
     xlab = "v2023 Arrivals Proportion", ylab = "v2022 Employment Proportion")
abline(0, 1)

compare_common_data_2015 <- new_data %>%
  left_join(old_data, by = c("rgn_id", "year")) %>%
  drop_na() %>%
  filter(year == 2015)

plot(compare_common_data_2015$Ap, compare_common_data_2015$Ep,
     xlab = "v2023 Arrivals Proportion", ylab = "v2022 Employment Proportion")
abline(0, 1)

3.0.9 Check out some specific countries (v2023) – this was for exploring changes in methodology, can skip or use/edit parts in future years

# check some countries that changed a lot in v2023's first push to global
check_countries_graph <- tourism_props %>% 
  filter(rgn_id == 24 | rgn_id == 51 | rgn_id == 189 | rgn_id == 118 | rgn_id == 31) %>%
  mutate(rgn_id_plot = as.factor(rgn_id),
         year_plot = as.Date(year, "%Y"))

Ap_graph <- ggplot(check_countries_graph, aes(x = year_plot, y = Ap, color = rgn_id_plot)) +
  geom_line() +
  theme_minimal() +
  labs(x = "",
       color = "Region ID")

Ap_rescaled_graph <- ggplot(check_countries_graph, aes(x = year_plot, y = Ap_rescaled, color = rgn_id_plot)) +
  geom_line() +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(x = "Year",
       y = "Rescaled Ap")

library(patchwork)
Ap_graph / Ap_rescaled_graph

countries_in_2021 <- check_countries_graph %>%
  filter(year == "2021") %>%
  left_join(rgns_eez, by = "rgn_id") %>%
  select(-year_plot, -admin_rgn_id, -admin_country_name, -Notes, -rgn_id_plot, -eez_iso3, -territory)

library(kableExtra)
kable(countries_in_2021) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

# check what countries have really large Ap before rescaling and check the inland + offshore area of those (to see if a small area is skewing things)
top_Ap_values <- tourism_props %>%
  left_join(inland_offshore, by = c("rgn_id", "year")) %>%
  group_by(rgn_id) %>%
  summarize(max_Ap = max(Ap),
            total_inland_offshore_area = mean(total_inland_offshore_area)) %>%
  arrange(desc(max_Ap)) %>%
  left_join(rgns_eez, by = "rgn_id") %>%
  select(rgn_name, max_Ap, total_inland_offshore_area)

kable(top_Ap_values, digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

# prepare total area (land + 3 nm offshore) of each ohi region
library(sf)
regions_no_geom <- regions %>%
  st_drop_geometry() %>%
  filter(rgn_type == "land") %>%
  select(rgn_id, area_km2) %>%
  left_join(offshore_data, by = "rgn_id", relationship = "many-to-many") %>%
  select(-a_prot_km2, -rgn_name) %>%
  mutate(area_tot = area_km2 + a_tot_km2,
         year = as.character(year)) %>%
  select(-area_km2, -a_tot_km2)

# combine inland + offshore data, population, total area, arrivals, original Ap (arrivals/coastline), and some tests of other Ap calculations
top_Ap_values_2021 <- tourism_props %>% # can adapt this code to find countries to manually add population data for
  ungroup() %>%
  left_join(combined_pops_filled, by = c("rgn_id", "year")) %>%
  select(rgn_id, year, population, Ap) %>%
  left_join(unwto_dupe_fix_downup_gf, by = c("rgn_id", "year")) %>%
  select(-arrivals_method) %>%
  left_join(rgns_eez, by = c("rgn_id")) %>%
  select(rgn_id, rgn_name, tourism_arrivals_ct, Ap, year) %>%
  left_join(inland_offshore, by = c("rgn_id", "year")) %>%
  left_join(combined_pops_filled, by = c("rgn_id", "year")) %>%
  left_join(regions_no_geom, by = c("rgn_id", "year"), relationship = "many-to-many") %>%
  select(rgn_id, rgn_name, Ap, year, area_tot, population, total_inland_offshore_area, tourism_arrivals_ct) %>%
  group_by(rgn_id) %>%
  arrange(rgn_id, year) %>%
  mutate(tourism_arrivals_ct = ifelse(year == 2021 & is.na(tourism_arrivals_ct), lag(tourism_arrivals_ct, n = 2), tourism_arrivals_ct)) %>%
  fill(tourism_arrivals_ct, .direction = "downup") %>%
  ungroup() %>%
  filter(year == 2021) %>%
  group_by(rgn_id) %>%
  mutate(Ap_w_pop_and_tot_area = (tourism_arrivals_ct/total_inland_offshore_area/population/area_tot),
         Ap_w_pop = (tourism_arrivals_ct/total_inland_offshore_area/population), # will be slightly different than the Ap value because this code gapfills arrivals (before dividing anything), while the Ap values gapfilled Ap values after dividing everything; can alter the code if you want a different comparison
         Ap_w_tot_area = (tourism_arrivals_ct/total_inland_offshore_area/area_tot)) %>%
  ungroup() %>%
  select(rgn_name, Ap, Ap_w_pop, Ap_w_tot_area, Ap_w_pop_and_tot_area, area_tot, population, total_inland_offshore_area, tourism_arrivals_ct) %>%
  arrange(desc(Ap)) # decide what to arrange by highest to lowest

# create table that has everything
kable(top_Ap_values_2021, digits = 5) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

# create histogram of Ap
hist(top_Ap_values_2021$Ap, breaks = 100, main = "Distribution of Ap", xlab = "Ap")

# check where Ap values fall for these countries (should hopefully not be too low but may be unavoidable)
which(top_Ap_values_2021$rgn_name == "United States")
which(top_Ap_values_2021$rgn_name == "China")
which(top_Ap_values_2021$rgn_name == "India")
which(top_Ap_values_2021$rgn_name == "Russia")
which(top_Ap_values_2021$rgn_name == "Philippines")

# table of final (rescaled) Ap values for current version year
new_data_for_table_2021 <- new_data %>%
  left_join(rgns_eez, by = "rgn_id") %>%
  select(rgn_id, rgn_name, year, Ap) %>%
  arrange(desc(Ap)) %>%
  rename(Ap_rescaled = Ap) %>%
  filter(year == 2021) %>% 
  select(-year)

kable(new_data_for_table_2021, digits = 5) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

# create histogram of Ap (rescaled)
hist(new_data_for_table_2021$Ap_rescaled, breaks = 100, main = "Distribution of Ap (rescaled)", xlab = "Ap (rescaled)")

# check where Ap (rescaled) values fall for these countries (should hopefully not be too low but may be unavoidable)
which(new_data_for_table_2021$rgn_name == "United States")
which(new_data_for_table_2021$rgn_name == "China")
which(new_data_for_table_2021$rgn_name == "India")
which(new_data_for_table_2021$rgn_name == "Russia")
which(new_data_for_table_2021$rgn_name == "Philippines")

# check some more countries (rescaled)
new_data_for_table_2021 %>%
  filter(rgn_id == 80 | rgn_id == 110 | rgn_id == 18 | rgn_id == 162)

which(new_data_for_table_2021$rgn_name == "Greece")
which(new_data_for_table_2021$rgn_name == "Bahamas")
which(new_data_for_table_2021$rgn_name == "Fiji")
which(new_data_for_table_2021$rgn_name == "New Zealand")

4 Ts: Tourism sustainability

These data are from the World Economic Forum’s “Travel and Tourism Development Index” (https://www.weforum.org/reports/travel-and-tourism-development-index-2021/downloads-510eb47e12) See mazu: _raw_data/WEF-Economics/ for more details and the raw data.

The TTDI was formerly the TTCI which was a similar index, but unfortunately not comparable. The TTDI only extends back to 2019.

These data are gapfilled using gdppcppp and UN georegion information (see next section for obtaining and preparing these data).

# update to latest file name
ttdi_file <- "WEF_TTDI_2021_data_for_download.xlsx"

ttdi_raw <- read_excel(paste0(dir_M, "/git-annex/globalprep/_raw_data/WEF-Economics/d", version_year, "/", ttdi_file),
                 skip = 2) 

# move up column names from first row while keeping the full country names as columns too
names(ttdi_raw)[1:9] <- as.character(ttdi_raw[1, 1:9])

# filtering for sustainability scores, selecting needed columns, and pivoting to tidy format
ttdi <- ttdi_raw %>%
  filter(Title == "T&T Sustainability subindex, 1-7 (best)",
         Attribute == "Score") %>% 
  select(year = Edition, Albania:Zambia) %>% 
  # currently Zambia is the last country column
  pivot_longer(cols = Albania:Zambia, names_to = "country",
               values_to = "score") %>% 
  mutate(score = as.numeric(score))


# Changing names that are not recognized by ohicore
ttdi <- ttdi %>%
    mutate(country = ifelse(str_detect(country, "Ivoire"), "Ivory Coast", country))
  
  
ttdi_rgn <- name_2_rgn(df_in = ttdi, 
                       fld_name='country')

## Duplicated regions weighted mean
weight_data <- data.frame(country = c("China", "Hong Kong SAR"),
                          # pop values from World Bank 2021 estimates - updated v2022
                          population = c(1412360000, 7413100))


ttdi_rgn <- ttdi_rgn %>%
  arrange(country) %>%
  left_join(weight_data, by = "country") %>%
  mutate(population = ifelse(is.na(population), 1, population)) %>%
  group_by(rgn_id, rgn_name, year) %>%
  summarize(score = weighted.mean(score, population)) %>%
  select(year, rgn_id, rgn_name, score)

# compare with old dataframe to make sure only the duplicated region scores changed 

head(ttdi_rgn, 10)

### Save TTDI data file
write.csv(ttdi_rgn, here(paste0("globalprep/tr/v", version_year, "/intermediate/wef_ttdi.csv")), row.names = FALSE)

4.1 Preparing the gdppcppp data:

These data are used to gapfill missing values in tourism sustainability. Most of the data are from the World Bank, but CIA data fill some gaps (CIA data is available for only the most recent year).

The Artisanal Opportunities goal uses gdppcppp data, so we will get the data that was processed for that goal.

wb <- read.csv(here(paste0("globalprep/ao/v", version_year, "/intermediate/gdppcppp_ohi.csv"))) %>%
  dplyr::select(rgn_id, year, value)

CIA data are used to fill in missing gaps in the gdppcppp data (https://www.cia.gov/the-world-factbook/field/real-gdp-per-capita/country-comparison)

Downloaded: 07/05/2022

See README on the raw folder for instructions on how to download this data.

The following code is used to prepare these data for OHI:

cia_gdp <- read.csv(here(paste0("globalprep/tr/v", version_year, "/raw/cia_gdp_pc_ppp.csv"))) %>% 
  # remove dollar signs and commas and convert to numeric
  mutate(value = as.numeric(gsub("[$,]", "", value))) %>% 
  select(name, value) %>% 
  rename(country = name, pcgdp_cia = value)

 ## Data reported in a lower resolution than OHI regions
splits <- data.frame(country = "Saint Helena, Ascension, and Tristan da Cunha", 
                     country2 = c("Saint Helena", "Ascension","Tristan da Cunha"))

cia_gdp <- cia_gdp %>%
  left_join(splits, by='country') %>%
  mutate(country2 = ifelse(is.na(country2), country, country2)) %>%
  select(country = country2, pcgdp_cia)

cia_gdp_rgn <- name_2_rgn(df_in = cia_gdp, 
                       fld_name='country')

### Duplicated regions: Collapse regions after weighting by population (regions we include as a single region) - 

population_weights <- data.frame(country = c("Virgin Islands", "Puerto Rico",
                                             "China", "Hong Kong", "Macau",
                                             "Guam", "Northern Mariana Islands"),
                                 # from world bank - updated v2022
                                 population = c(105870, 3263584, 1412360000,
                                                7413100, 658391, 170184, 57910))

cia_gdp_rgn <- cia_gdp_rgn %>%
  left_join(population_weights, by="country") %>%
  mutate(population = ifelse(is.na(population), 1, population)) %>%
  group_by(rgn_id) %>%
  summarize(pcgdp_cia = weighted.mean(pcgdp_cia, population)) %>%
  ungroup() %>%
  filter(rgn_id <= 250) %>%
  select(rgn_id, pcgdp_cia)

write.csv(cia_gdp_rgn, here(paste0("globalprep/tr/v", version_year, "/intermediate/wb_rgn_cia_GDPPCPPP.csv")), row.names=FALSE)

The following code combines the two gdp datasets and gapfills missing regions using UN georegions.

If there is no World Bank gdppcppp data (pcgdp), the CIA data is used (pcgdp_cia). The pcgdp2 variable includes both the World Bank and CIA data (with CIA data only used if there is not World Bank data). The remaining data are estimated using UN geopolitical regions. Ideally, the mean gdppcppp value is calculated at the r2 scale (gdp_pred_r2) using regions within each class with gdppcppp data. If there were not enough regions with data at the r2 scale, the average at the r1 scale was used (gdp_pred_r1). The gdp_all variable combines all estimates using the following heirarchy: World Bank -> CIA -> estimated using mean from r2 UN geopolitical regions -> estimated using mean from r1 UN geopolitical regions.

### world bank gdp data
gdppcppp <- wb %>%
  select(rgn_id, year, pcgdp = value)

### cia gdp data
gdppcppp2 <- read.csv(here(paste0("globalprep/tr/v", version_year, "/intermediate/wb_rgn_cia_GDPPCPPP.csv")))


### Use WB data, but if missing, use pcgdp_cia.
### combine with UN georegion data
years <- data.frame(year = min(gdppcppp$year):max(gdppcppp$year))

georegions <- ohicore::georegions

regions <- georegions %>%
  left_join(georegion_labels, by = 'rgn_id')

gdp_raw <- merge(years, regions, by=NULL) %>%
   left_join(gdppcppp, by = c('rgn_id', 'year')) %>%
  left_join(gdppcppp2, by = c("rgn_id")) 

## quick compare to make sure the CIA and World Bank data are compatible
plot(gdp_raw$pcgdp[gdp_raw$year==2021], gdp_raw$pcgdp_cia[gdp_raw$year==2021])
abline(0,1, col="red")
# a few minor outliers but overall looks good

gdp_raw <- gdp_raw %>%
  mutate(pcgdp2 = ifelse(is.na(pcgdp), pcgdp_cia, pcgdp))

## Calculating the means across different geopolitical levels (e.g. r2, r1)
gdp_raw <- gdp_raw %>%
  group_by(r2, year) %>%
  mutate(gdp_pred_r2 = mean(pcgdp2, na.rm=TRUE)) %>%
  ungroup() %>%
  group_by(r1, year) %>%
  mutate(gdp_pred_r1 = mean(pcgdp2, na.rm=TRUE)) %>%
  ungroup() 

gdp_raw_gf <- gdp_raw %>%
  mutate(gdp_all = ifelse(is.na(pcgdp2), gdp_pred_r2, pcgdp2)) %>%
  mutate(gdp_all = ifelse(is.na(gdp_all), gdp_pred_r1, gdp_all)) %>%
  mutate(gapfilled = ifelse(is.na(pcgdp2) & !is.na(gdp_all), "gapfilled", NA)) %>%
  mutate(method = ifelse(is.na(pcgdp2) & !is.na(gdp_pred_r2), "UN georegion (r2)", NA)) %>%
  mutate(method = ifelse(is.na(pcgdp2) & is.na(gdp_pred_r2) & !is.na(gdp_pred_r1), "UN georegion (r1)", method)) 

write_csv(gdp_raw_gf, here(paste0("globalprep/tr/v", version_year, "/intermediate/gdp_raw_gf.csv")))

gdp_data_gf <- gdp_raw_gf %>%
  select(rgn_id, year, gapfilled, method) 

write_csv(gdp_data_gf, here(paste0("globalprep/tr/v", version_year, "/intermediate/gdp_gf.csv")))

gdp_data <- gdp_raw_gf %>%
  select(rgn_id, year, pcgdp = gdp_all)

write_csv(gdp_data, here(paste0("globalprep/tr/v", version_year, "/intermediate/gdp.csv")))

The final step is gapfilling the Sustainability data using a linear model with gdppcppp and UN geopolitical regions as predictor variables.

sust <- read.csv(here(paste0("globalprep/tr/v", version_year, "/intermediate/wef_ttdi.csv")), stringsAsFactors = FALSE)

### don't need to gapfill data without tourism data:
## Most recent tourism data is 2019.  

ap_gf <- read.csv(here(paste0("globalprep/tr/v", version_year, "/output/tr_arrivals_props_tourism.csv"))) %>%
  # filter(year == 2021) %>%
  select(rgn_id, Ap, year) %>%
  filter(!is.na(Ap))

# gdp dataframe prepared above (World Bank, CIA, and gapfilled gdp data)
gdp_raw_gf <- read.csv(here(paste0("globalprep/tr/v", version_year, "/intermediate/gdp_raw_gf.csv")), stringsAsFactors = FALSE) %>% 
  # filter(year == 2021) %>%
  select(rgn_id, r0_label, r1_label, r2_label, rgn_label,
         pcgdp, pcgdp_cia, pcgdp2, gdp_all, year) 

tr_sust <- gdp_raw_gf %>%
           left_join(sust, by = c("rgn_id", "year")) %>%
          left_join(ap_gf, by = c("rgn_id", "year")) %>%
          rename(S_score = score) %>%
          filter(rgn_id != 213)

### Add gapfill flag variable 
## Reminder:
## pcgdp2: includes both the World Bank and CIA data (with CIA data only used if there is not World Bank data)
## Ep: Proportion of workforce directly employed in tourism
## S_score: tourism sustainability score

tr_sust_gf <- tr_sust %>%
  mutate(gapfilled = ifelse(is.na(S_score) & !is.na(Ap), "gapfilled", NA)) %>%
  mutate(method = ifelse(is.na(S_score) & !is.na(Ap) & is.na(pcgdp2), "lm georegion + gdppcppp, with est. gdppcppp", NA)) %>%
  mutate(method = ifelse(is.na(S_score) & !is.na(Ap) & !is.na(pcgdp2), "lm georegion + gdppcppp", method)) %>%
  select(rgn_id, gapfilled, method, year)

write.csv(tr_sust_gf, here(paste0("globalprep/tr/v", version_year, "/output/tr_sustainability_gf.csv")), row.names=FALSE)

4.1.1 Gapfilling

Linear models using gdppcppp and UN geopolitical regions as predictor variables. However if there is no gdppc data we estimate the gdppc using the UN georegions and then used in the linear model to gapfill the sustainability score.

### Gapfill S using r1 and/or r2 regional data and PPP-adjusted per-capita GDP
### Looked at models with a year variable, but wasn't significant and decided to exclude

mod3 <- lm(S_score ~ as.factor(r2_label) + gdp_all, data=tr_sust, na.action = na.exclude)
summary(mod3)
anova(mod3)

mod4 <- lm(S_score ~ as.factor(r1_label) + gdp_all, data=tr_sust, na.action = na.exclude)
summary(mod4)
anova(mod4)

plot(predict(mod3), tr_sust$S_score)
abline(0,1)
plot(predict(mod4), tr_sust$S_score)
abline(0,1)


## Estimate missing data and gapfill
# Some of the r1 levels do not have data and consequently causes a fail. This chunk of code drops these levels so an NA is returned

# Select only r2 column
new_data <- tr_sust %>% 
  dplyr::select(r2_label, gdp_all)

unique(tr_sust$r2_label)

r2_w_data <- unique(tr_sust$r2_label[!is.na(tr_sust$S_score)])
  
new_data_r2 <- new_data %>%
  mutate(r2_label = ifelse(r2_label %in% r2_w_data, r2_label, NA))

# Predict sustainability scores using linear model 3 (using r2 data)
tr_sust <- tr_sust %>% 
  dplyr::mutate(S_score_pred_r2 = predict(mod3, newdata = new_data_r2))


# Select only r1 column
new_data <- tr_sust %>% 
  dplyr::select(r1_label, gdp_all)

unique(tr_sust$r1_label)

r1_w_data <- unique(tr_sust$r1_label[!is.na(tr_sust$S_score)])

new_data_r1 <- new_data %>%
  mutate(r1_label = ifelse(r1_label %in% r1_w_data, r1_label, NA))

# Predict sustainability scores using linear model 4 (using r1 data)
tr_sust <- tr_sust %>% 
  dplyr::mutate(S_score_pred_r1 = predict(mod4, newdata = new_data_r1))



## some are missing the r1 predictions, but none of these have Ep scores, so not relevant
View(filter(tr_sust, is.na(S_score_pred_r1)))

tr_sust <- tr_sust %>%
  mutate(S_score_2 = ifelse(is.na(S_score), S_score_pred_r2, S_score)) %>%
  mutate(S_score_2 = ifelse(is.na(S_score_2), S_score_pred_r1, S_score_2)) %>%
  filter(year %in% c(2019, 2021)) %>%
  select(rgn_id, year, S_score=S_score_2)

summary(tr_sust)

write_csv(tr_sust, here(paste0("globalprep/tr/v", version_year, "/output/tr_sustainability.csv")))

4.2 Compare with previous year of data

tr_sust <- read_csv(here(paste0("globalprep/tr/v", version_year, "/output/tr_sustainability.csv")))

prev_year <- (as.numeric(version_year) - 1) %>% 
  as.character()

compare <- tr_sust %>% 
  pivot_wider(names_from = year, values_from = S_score)

# current vs previous year of data
plot(compare$"2021", compare$"2019")
abline(0, 1, col="red")
# looks good

5 Tw: Travel warnings

  • Travel warnings were deleted from the v2020 assessment.