This document describes the steps for preparing the pathogen pollution and pathogen pollution trend data layers for the 2021 global assessment.
The percentage of the population with access to improved sanitation facilities (World Health Organization and United Nations Children’s Fund, Joint Monitoring Programme, 2011) was used in combination with measurements of coastal population as a proxy for pathogens in coastal waters. Access to improved sanitation facilities is defined as the percentage of the population in a country with at least adequate access to disposal facilities that can effectively prevent human, animal, and insect contact with excreta. These data are a country-wide average (not specific to the coastal region).
Data source (WHO-UNICEF) now reports sanitation data for three different sectors: households, schools, and health care facilities. We decided to use the household data as it is most likely to include the greatest number of citizens of each region, and be most comparable with previous datasets.
WHO-UNICEF also changed how they report the percentages of the population with access to basic sanitation. In previous years, some regions were reported with 100% of the population having basic access, which are now denoted as >99 in the raw data. We converted this value to 99.5% as there were other regions with 99%. The data now cover years 2000-2020, vs. last year’s assessment which had data for 2000-2017.
Updates on our end:
Consider for future assesments: make the “Safely managed” data more complete, and reconsider whether these data would be better to use.
Definition of each variable according to the data source: “At least basic”: Use of improved facilities that are not shared with other households.
“Safely managed”: Use of improved facilities that are not shared with other households and where excreta are safely disposed of in situ or transported and treated offsite
Reference: https://washdata.org/data Updated October 2021
Downloaded: Downloaded 10/01/2021
Description: Percentage of the National population that has access to improved facilities that are not shared with other households (National, At a basic level).
Access to improved sanitation facilities is defined as the percentage of the population within a country with at least adequate access to excreta disposal facilities that can effectively prevent human, animal, and insect contact with excreta.
Native data resolution: country-wide average (not specific to the coastal region)
Time range: 2000 - 2020
Format: csv
Percentage of people without sanitation was multiplied by the coastal population (within 25 miles of the coast) to estimate the number of people located along the coast without sanitation. This value was rescaled between 0 and 1 by dividing by the 99th quantile across all regions from years 2000 to 2009.
library(ohicore) #devtools::install_github('ohi-science/ohicore@dev')
library(zoo) # for na.locf: Last Observation Carried Forward
library(tidyverse)
library(here)
library(plotly)
source('https://raw.githubusercontent.com/OHI-Science/ohiprep_v2021/gh-pages/workflow/R/common.R')
# Load data from Mazu
<- read.csv(file.path(dir_M, "git-annex/globalprep/_raw_data/WHO_UNICEFF/d2021/JMP_2021_WLD.csv"), header=FALSE, sep = ",", na.strings = c(NA, ''), stringsAsFactors = FALSE, strip.white = TRUE) sani_raw
Selecting and naming the columns of interest. Scale the percentage of population with access to improved sanitation to proportion (from 0-1). Transform population and percentage into a numeric variable.
<- sani_raw %>%
sani ::slice(-1) %>% # cut first row with column names
dplyr::select(country = V1,
dplyryear = V3,
pop = V4,
basic_pct = V6) %>%
::mutate(basic_pct = ifelse(basic_pct=="-", NA, basic_pct)) %>%
dplyr::mutate(pop = stringr::str_remove_all(pop, pattern = " ")) %>%
dplyr::mutate(basic_pct = ifelse(stringr::str_detect(basic_pct,">99"), 99.5, basic_pct)) %>%
dplyr::mutate_at(.vars = c("pop", "basic_pct", "year"), .funs = as.numeric)%>%
dplyr::mutate(pop = pop * 1000,
dplyrbasic_prop = basic_pct/100) %>%
::filter(!is.na(year)) dplyr
Change names of regions to match names in ohicore and filter out regions that are not part of the OHI assesment or do not have data.
If after running ‘name_2_rgn’ (see next r chunk), there are some coastal regions that are not identified by name_2rgn
function. They must be checked to determine how to best include them (or not include them).
<- sani %>%
sani_improved ::mutate(country = ifelse(stringr::str_detect(country,"Cura\\Sao"), "Curacao", country))%>%
dplyr::mutate(country = ifelse(stringr::str_detect(country,"R\\Sunion"), "Reunion", country)) %>%
dplyr::mutate(country = ifelse(stringr::str_detect(country, "C\\Ste d\\SIvoire"), "Ivory Coast", country)) %>%
dplyr::mutate(country=ifelse(stringr::str_detect(country,"Hong Kong"), "Hong Kong", country)) %>%
dplyr::mutate(country=ifelse(stringr::str_detect(country,"Macao"), "Macao", country)) %>%
dplyr::mutate(country=ifelse(stringr::str_detect(country, "Saint Martin"), "Northern Saint-Martin", country))
dplyr
## v2021: Reporting Caribbean Netherlands regions (Bonaire, Sint Eustatius, and Saba) at a higher resolution:
<- filter(sani_improved, country=="Caribbean Netherlands") %>%
CN rename(country_old = country)
<- data.frame(country_old = "Caribbean Netherlands",
CN_subregions country = c("Bonaire", "Sint Eustatius", "Saba")) %>%
left_join(CN) %>%
select(-country_old)
<- sani_improved %>%
sani_improved filter(country != "Caribbean Netherlands") %>%
rbind(CN_subregions)
# Channel Islands correspond to OHI regions of Guernsey and Jersey. Here the data reported for Channel Islands is used for these two regions.
<- dplyr::filter(sani_improved, country=="Channel Islands") %>%
CI ::rename(country_old = country)
dplyr<- data.frame(country_old = "Channel Islands",
CI_subregions country = c("Guernsey", "Jersey")) %>%
::left_join(CI) %>%
dplyr::select(-country_old)
dplyr
<- sani_improved %>%
sani_improved ::filter(country != "Channel Islands") %>%
dplyrrbind(CI_subregions)
Add rgn_id and merge duplicate regions using a mean weighted by population.
<- name_2_rgn(sani_improved,
rgn_sani fld_name = 'country',
flds_unique = c('country','year'))
## Check list in warning to make sure all countries removed are not of interest for the OHI global assesment.
# v2019 non-matched regions: Eswatini (landlocked), Isle of Man (not reported in OHI), North Macedonia (landlocked), State of Palestine (disputed), Saint Barthelemy (not reported in OHI)
##Check for duplicate regions.
sort(table(rgn_sani$rgn_id)) #for this analysis, regions which are duplicated are: rgn_id = 209, 140, 116, 13
## Each region has 18 years of data in this case. If there are regions with more than 18 values is because in this database they are reported in a higher resolution than the OHI global assesment (eg: China, Hong Kong and Macao are just one region for the OHI global). For these regions a weighted average will be calculated as its final score.
## Calculating weighted means for the duplicated regions.
<- rgn_sani %>%
sani_final ::group_by(rgn_id, rgn_name, year) %>%
dplyr::summarise(basic_sani_prop= weighted.mean(basic_prop, pop, na.rm = TRUE))
dplyr
##Check for duplicate regions. At this point, all regions should have the same sample size.
sort(table(sani_final$rgn_id))
First step is to get an idea of what needs to be gapfilled.
<- sani_final %>%
sani_gf ::group_by(rgn_id, rgn_name) %>%
dplyr::mutate(gf_count = sum(is.na(basic_sani_prop))) %>% # create column to count # of NAs (# of data points to gapfill)
dplyrungroup()
summary(sani_gf) # 161 NAs in v2019
# 212 NAs in v2021; makes sense, there are more years of data
## list of regions that need gapfilling
::filter(sani_gf, gf_count>0) %>%
dplyr::select(rgn_id, rgn_name, gf_count) %>%
dplyrunique() %>%
data.frame()
##Some regions have no data - we will filter them out to be gapfilled later.
##Define number of years of data
<- mean(table(sani_gf$rgn_id))
years_data
<- sani_gf %>%
sani_gf ::filter(gf_count != years_data)
dplyr
# Note: there are regions in this data frame that do not need to be gapfilled.
Use a linear model within country data to estimate missing years.
<- sani_gf %>%
sani_gf_lm ::group_by(rgn_id, rgn_name) %>%
dplyr::do({
dplyr<- lm(basic_sani_prop ~ year, data = .)
mod <- predict(mod, newdata = .[c('year')])
gf_lm data.frame(., gf_lm)
%>%
}) ::ungroup()
dplyr
<- sani_gf_lm %>%
sani_gf_lm mutate(gf_lm = ifelse(gf_lm > 1, 1, gf_lm)) %>% # constrain predictions to <=1
mutate(method = ifelse(is.na(basic_sani_prop), "lm prediction based on year", NA)) %>%
mutate(basic_sani_prop = ifelse(is.na(basic_sani_prop), gf_lm, basic_sani_prop))
Georegional gapfilling for regions that do not have data.
UNgeorgn()
<- UNgeorgn %>%
UNgeorgn ::select(rgn_id, rgn_label, r1=r1_label, r2=r2_label)
dplyr
<- min(sani_gf_lm$year):max(sani_gf_lm$year) #defines the year range
year
<- UNgeorgn %>%
sani_georgn_gf expand(year, UNgeorgn) %>%
::left_join(sani_gf_lm, by = c('rgn_id', 'year'))
dplyr
##Calculate two different gapfill columns using r2 and r1 UN geopolitical classification
<- sani_georgn_gf %>%
sani_georgn_gf ::group_by(year, r2) %>%
dplyr::mutate(basic_sani_r2 = mean(basic_sani_prop, na.rm=TRUE)) %>%
dplyr::ungroup() %>%
dplyr::group_by(year, r1) %>%
dplyr::mutate(basic_sani_r1 = mean(basic_sani_prop, na.rm=TRUE)) %>%
dplyr::ungroup()%>%
dplyr::arrange(rgn_id, year)
dplyr
##First gapfill with r2, if no value available use r1; create column indicating whether value was gapfilled and if so, by what method. Give NA to inhabited regions
<- sani_georgn_gf %>%
sani_georgn_gf ::mutate(method = ifelse(is.na(basic_sani_prop) & !is.na(basic_sani_r2), "UN georegion avg. (r2)", method)) %>%
dplyr::mutate(method = ifelse(is.na(basic_sani_prop) & is.na(basic_sani_r2) & !is.na(basic_sani_r1), "UN georegion avg (r1)", method))%>%
dplyr::mutate(basic_sani_prop = ifelse(is.na(basic_sani_prop) & !is.na(basic_sani_r2), basic_sani_r2, basic_sani_prop)) %>%
dplyr::mutate(basic_sani_prop = ifelse(is.na(basic_sani_prop) & !is.na(basic_sani_r1), basic_sani_r1, basic_sani_prop)) %>%
dplyr::select(rgn_id, rgn_label, year, basic_sani_prop, method)
dplyr
##See regions that have not been gapfilled.
::filter(sani_georgn_gf, is.na(basic_sani_prop)) %>%
dplyr::select(rgn_id, basic_sani_prop) %>%
dplyrunique() %>%
data.frame() #NA values for uninhabitated regions.
Uninhabited regions get a perfect score.
## Identify uninhabited regions
low_pop() # requires common.R to be loaded
<- low_pop %>%
low_pop ::filter(est_population < 100 | is.na(est_population))
dplyr
#Fill in all inhabited in rgn_inhab with perfect sanitation prop
<- sani_georgn_gf %>%
sani_complete ::mutate(basic_sani_prop = ifelse(rgn_id %in% low_pop$rgn_id, 1, basic_sani_prop)) %>%
dplyr::mutate(gapfill = ifelse(is.na(method), 0, 1)) %>%
dplyr::mutate(gapfill = ifelse(rgn_id %in% low_pop$rgn_id, 0, gapfill)) %>%
dplyr::mutate(method = ifelse(rgn_id %in% low_pop$rgn_id, "No est. human population", method)) %>%
dplyr::select(rgn_id, rgn_name = rgn_label, year, basic_sani_prop, gapfill, method)
dplyr
summary(sani_complete) # should be no more NAs
table(sani_complete$gapfill, sani_complete$method) # no pop is not considered gapfilled, should be N years * 20 regions
write_csv(sani_complete, here("globalprep/prs_cw_pathogen/v2021/intermediate/sani_complete.csv"))
# Quick check with v2019 data
<- read_csv(here("globalprep/prs_cw_pathogen/v2019/intermediate/sani_complete.csv")) %>%
sani_complete_old rename(basic_sani_prop_2019 = basic_sani_prop) %>%
rename(gapfill_2019 = gapfill) %>%
rename(method_2019 = method) %>%
left_join(sani_complete, by=c("rgn_id", "year", "rgn_name")) %>%
filter(year == 2017)
ggplotly(ggplot(sani_complete_old, aes(y = basic_sani_prop, x = basic_sani_prop_2019, labels = rgn_id)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, color = "red"))
# save gapfilling info
<- sani_complete %>%
gf_data ::select(rgn_id, year, gapfill, method)
dplyr
write_csv(gf_data, here("globalprep/prs_cw_pathogen/v2021/output/po_pathogen_popdensity25mi_gf.csv"))
<- sani_complete %>%
gf_data_trend ::arrange(rgn_id, year) %>%
dplyr::group_by(rgn_id) %>%
dplyr::mutate(gapfill_5year = rollsum(gapfill, 5, align="right", fill=NA)) %>%
dplyr::mutate(method = paste(na.exclude(unique(method)), collapse = ", ")) %>%
dplyr::mutate(gapfill = gapfill_5year/5) %>%
dplyr::select(rgn_id, year, gapfill, method)
dplyr
write_csv(gf_data, here("globalprep/prs_cw_pathogen/v2021/output/po_pathogen_popdensity25mi_trend_gf.csv"))
First calculate coastal population density (people/km2) is calculated by dividing the population within 25 miles of the coast by km2 within the 25 mile inland coastal area (yes! This is confusing because area is in km^2, despite the boundary being 25 miles inland).
## Population within 25 miles of coastline - v2021; hasn't been udpated for 2021, so use 2020 data
<- read_csv(here("globalprep/mar_prs_population/v2020/output/mar_pop_25mi.csv")) %>%
population ::arrange(rgn_id, year)
dplyr
#Read area 25mi inland to calculate population density
# (NOTE: this is confusing because it calculates the area in km2 for the 25 mile inland area)
<- read_csv(here("globalprep/mar_prs_population/v2020/int/area_km2_25mi.csv"))
area
# People per km2 (for the 25 mile inland area)
<- population %>%
pop_density ::left_join(area, by = 'rgn_id') %>%
dplyr::mutate(pop_per_km2 = popsum/area_km2) %>%
dplyr::select(rgn_id, year, pop_per_km2)
dplyr
##Save population density data
write_csv(pop_density, here("globalprep/prs_cw_pathogen/v2021/intermediate/pathogen_pop_density_25mi.csv"))
These data are transformed to a pressure, with a zero score indicating no pressure and 1 indicating the highest possible pressure. Given this we want to determine the number of people without access.
The number of people per km^2 without access to sanitation is calculated by:
<- sani_complete %>%
unsani_pop ::select(rgn_id, rgn_name, year, basic_sani_prop) %>%
dplyr::left_join(pop_density,
dplyrby=c('rgn_id', 'year')) %>%
::mutate(propWO_x_pop = (1 - basic_sani_prop) * pop_per_km2, # this calculates the population density of people without access (WO)
dplyrpropWO_x_pop_log = log(propWO_x_pop + 1)) # log is important because the skew was high otherwise
hist(unsani_pop$propWO_x_pop, main = "people without access")
hist(unsani_pop$propWO_x_pop_log, main = "log of people without access")
The reference point is the 99th quantile across all countries and years 2000-2009 as a reference point.
##Calculate reference point
<- unsani_pop %>%
ref_calc ::filter(year %in% 2000:2009) %>% #years of reference
dplyr##summarise(ref= max(propWO_x_pop_log, na.rm = TRUE)*1.1) %>% # old method
::summarise(ref= quantile(propWO_x_pop_log, probs=c(0.99), na.rm = TRUE)) %>%
dplyr$ref
.
ref_calc## save to the master reference point list - new folder might need to be created for assessment year if this file does not already exist.
<- read.csv(here("globalprep/supplementary_information/v2021/reference_points_pressures.csv"), stringsAsFactors = FALSE)
master_refs
$ref_point[master_refs$pressure == "Sanitation"] <- ref_calc
master_refs
write.csv(master_refs, here("globalprep/supplementary_information/v2021/reference_points_pressures.csv"), row.names=FALSE)
<- read.csv(here("globalprep/supplementary_information/v2021/reference_points_pressures.csv"))
master_refs <- as.numeric(as.character(master_refs$ref_point[master_refs$pressure == "Sanitation"]))
ref_value #7.10
ref_value
<- unsani_pop %>%
unsani_prs ::mutate(pressure_score = propWO_x_pop_log / ref_value) %>%
dplyr::mutate(pressure_score = ifelse(pressure_score>1, 1, pressure_score)) %>% #limits pressure scores not to be higher than 1
dplyr::select(rgn_id, year, pressure_score)
dplyr
summary(unsani_prs)
#Save data pressure scores
write_csv(unsani_prs, here("globalprep/prs_cw_pathogen/v2021/output/po_pathogen_popdensity25mi.csv"))
# Compare to v2018 data
<- read_csv(here("globalprep/prs_cw_pathogen/v2019/output/po_pathogen_popdensity25mi.csv")) %>%
unsani_prs_old rename(pressure_score_2019 = pressure_score) %>%
left_join(unsani_prs, by=c("rgn_id", "year")) %>%
filter(year == 2015)
filter(unsani_prs_old, rgn_id %in% c(185, 208))
ggplotly(ggplot(unsani_prs_old, aes(y = pressure_score, x = pressure_score_2019, labels = rgn_id)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, color = "red"))
Using CalculateTrend
function form the ohicore
, trend is calculated by applying a linear regression model to the pressuere scores using a window of 5 years of data. The solope of the linear regression (annual change in pressure) is then divided by the earliest year to get proportional change and then multiplied by 5 to get estimate trend on pressure in the next five years.
##Calculate trend using CalculateTrend()
##Define relevant years: Min and max year of data to calculate trend
<- 2012
first_assess_year <- 2021
current_assess_year <- max(unsani_prs$year) ##max year of data
current_data_year <- first_assess_year - (current_assess_year - current_data_year)
first_data_year
<- data.frame() #create a data.frame to save trend socores
trend_data
##For loop: calculates trend for all assess years within the corresponding 5 year window.
#focal_year is the year for which the trend is being calculated.
for(focal_year in first_data_year:current_data_year){ #focal_year = 2017
<- (focal_year-4):focal_year #defines the 5 years window to calculate trend
trend_years
<- unsani_prs %>% #format data to work in CalculateTrend()
data_new select(rgn_id, year, status=pressure_score)
<- CalculateTrend(data_new, trend_years)
trend_data_focal_year
<- trend_data_focal_year %>%
trend_data_focal_year mutate(year = focal_year) %>%
select(rgn_id = region_id, year, trend=score) %>%
data.frame()
<- rbind(trend_data, trend_data_focal_year) #add trend calculation to dataframe crearted outside the loop
trend_data
}summary(trend_data)
##Save trend data
write_csv(trend_data, here("globalprep/prs_cw_pathogen/v2021/output/po_pathogen_popdensity25mi_trend.csv"))
<- read_csv(here("globalprep/prs_cw_pathogen/v2021/output/po_pathogen_popdensity25mi_trend.csv"))
new
<- read_csv(here("globalprep/prs_cw_pathogen/v2019/output/po_pathogen_popdensity25mi_trend.csv")) %>%
compare select(rgn_id, year, trend_2019 = trend) %>%
left_join(new, by=c('rgn_id', 'year')) %>%
filter(year == 2017)
ggplotly(ggplot(compare, aes(y = trend, x = trend_2019, labels = rgn_id)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, color = "red"))
# A few regions have large variation in trend between v2021 and v2019
# positive outliers: rgn_id 161, rgn_id 154
# negative outliers: 48, 8
# Outlier investigation
<- read_csv(here("globalprep/prs_cw_pathogen/v2019/output/po_pathogen_popdensity25mi.csv")) %>%
data_old rename(old_pressure_score=pressure_score)
<- read_csv(here("globalprep/prs_cw_pathogen/v2021/output/po_pathogen_popdensity25mi.csv"))
data_new
<- data_new %>%
outlier left_join(data_old, by=c("rgn_id", "year")) %>%
filter(rgn_id == 161)
plot(outlier$pressure_score, outlier$old_pressure_score)
abline(0,1, col="red")
We checked the main discrepancies and these were due to changes in source data. Fairly small changes in access can lead to fairly large changes in pressure scores, depending on the population. There are slightly higher pressure scores this year (indicated by points tending to be above the 1-1 red line) due to modifications of reference point calculations.
### Comparison of basic access to sanitation scores
<- read.csv(file.path(dir_M, "git-annex/globalprep/_raw_data/WHO_UNICEFF/d2021/JMP_2021_WLD.csv"), header=FALSE, sep = ",", na.strings = c(NA, ''), stringsAsFactors = FALSE, strip.white = TRUE)
sani_raw
<- read.csv(file.path(dir_M, "git-annex/globalprep/_raw_data/WHO_UNICEFF/d2019/JMP_2019_WLD.csv"), header=FALSE, sep = ",", na.strings = c(NA, ''), stringsAsFactors = FALSE, strip.white = TRUE)
sani_raw_old
<- sani_raw_old %>%
sani_old ::slice(-1) %>% # cut first row with column names
dplyr::select(country = V1,
dplyryear = V3,
pop = V4,
basic_pct = V6) %>%
::mutate(basic_pct = ifelse(basic_pct=="-", NA, basic_pct)) %>%
dplyr::mutate(pop = stringr::str_remove_all(pop, pattern = " ")) %>%
dplyr::mutate(basic_pct = ifelse(stringr::str_detect(basic_pct,">99"), 99.5, basic_pct)) %>%
dplyr::mutate_at(.vars = c("pop", "basic_pct", "year"), .funs = as.numeric)%>%
dplyr::mutate(pop = pop * 1000,
dplyrbasic_prop = basic_pct/100) %>%
::filter(!is.na(year))
dplyr
<- sani_old %>%
sani_old_outliers filter(country == "Palau" | country == "Oman" | country == "Niue" | country == "Wallis and Futuna") %>%
rename(pop_2019 = pop, basic_pct_2019 = basic_pct, basic_prop_2019 = basic_prop)
<- sani %>%
sani_compare filter(country == "Palau" | country == "Oman" | country == "Niue" | country == "Wallis and Futuna") %>%
left_join(sani_old_outliers, by=c("country", "year")) %>%
select(country, year, pop, pop_2019, basic_pct, basic_pct_2019) # yep, trends changes in raw data
### Comparison of pressure scores
<- read_csv(here("globalprep/prs_cw_pathogen/v2019/output/po_pathogen_popdensity25mi.csv")) %>%
unsani_prs_old rename(pressure_score_2019 = pressure_score) %>%
left_join(unsani_prs, by=c("rgn_id", "year")) %>%
filter(rgn_id %in% c(8, 48, 154, 161))
<- read_csv(here("globalprep/prs_cw_pathogen/v2019/output/po_pathogen_popdensity25mi.csv")) %>%
unsani_prs_old_all rename(pressure_score_2019 = pressure_score) %>%
left_join(unsani_prs, by=c("rgn_id", "year")) %>%
mutate(diff = pressure_score-pressure_score_2019)