ohi logo
OHI Science | Citation policy

Summary

This document describes the steps for preparing the pathogen pollution and pathogen pollution trend data layers for the 2019 global assessment.

The percentage of the population with access to improved sanitation facilities (World Health Organization and United Nations Children’s Fund, Joint Monitoring Programme, 2011) was used in combination with measurements of coastal population as a proxy for pathogens in coastal waters. Access to improved sanitation facilities is defined as the percentage of the population in a country with at least adequate access to disposal facilities that can effectively prevent human, animal, and insect contact with excreta. These data are a country-wide average (not specific to the coastal region).

Updates from previous assessment

Data source (WHO-UNICEF) now reports sanitation data for three different sectors: households, schools, and health care facilities. We decided to use the household data as it is most likely to include the greatest number of citizens of each region, and be most comparable with previous datasets.

WHO-UNICEF also changed how they report the percentages of the population with access to basic sanitation. In previous years, some regions were reported with 100% of the population having basic access, which are now denoted as >99 in the raw data. We converted this value to 99.5% as there were other regions with 99%. The data now cover years 2000-2017, vs. last year’s assessment which had data for 2000-2015.

Updates on our end:

  1. We changed the reporting of the Caribbean Netherlands regions to be at a higher resolution, as historical sanitation data for them have been backfilled by WHO-UNICEF as of this year.
  2. We also changed the way uninhabited regions (which we assign perfect scores in this layer) are identified. The code was changed to match the methods for other goals (AO, etc), which use the low_pop() function in common.R and filter out low population/uninhabited regions.

Consider for future assesments: make the “Safely managed” data more complete, and reconsider whether these data would be better to use.

Definition of each variable according to the data source: “At least basic”: Use of improved facilities that are not shared with other households.

“Safely managed”: Use of improved facilities that are not shared with other households and where excreta are safely disposed of in situ or transported and treated offsite


Data Source

Reference: https://washdata.org/data Updated July 2019

Downloaded: Downloaded 8/7/2019

Description: Percentage of the National population that has access to improved facilities that are not shared with other households (National, At a basic level).

Access to improved sanitation facilities is defined as the percentage of the population within a country with at least adequate access to excreta disposal facilities that can effectively prevent human, animal, and insect contact with excreta.

Native data resolution: country-wide average (not specific to the coastal region)

Time range: 2000 - 2017

Format: csv


Methods

Percentage of people without sanitation was multiplied by the coastal population (within 25 miles of the coast) to estimate the number of people located along the coast without sanitation. This value was rescaled between 0 and 1 by dividing by the 99th quantile across all regions from years 2000 to 2009.


Methods

Data wrangling

Selecting and naming the columns of interest. Scale the percentage of population with access to improved sanitation to proportion (from 0-1). Transform population and percentage into a numeric variable.

Change names of regions to match names in ohicore and filter out regions that are not part of the OHI assesment or do not have data.

If after running ‘name_2_rgn’ (see next r chunk), there are some coastal regions that are not identified by name_2rgn function. They must be checked to determine how to best include them (or not include them).

Add rgn_id and merge duplicate regions using a mean weighted by population.

Gapfilling

First step is to get an idea of what needs to be gapfilled.

Gapfilling 2: Georegional averages

Georegional gapfilling for regions that do not have data.

UNgeorgn()
UNgeorgn <- UNgeorgn %>%
  dplyr::select(rgn_id, rgn_label, r1=r1_label, r2=r2_label)


year <- min(sani_gf_lm$year):max(sani_gf_lm$year) #defines the year range

sani_georgn_gf <- UNgeorgn %>%
  expand(year, UNgeorgn) %>%
  dplyr::left_join(sani_gf_lm, by = c('rgn_id', 'year'))


##Calculate two different gapfill columns using r2 and r1 UN geopolitical classification
sani_georgn_gf <- sani_georgn_gf %>%
  dplyr::group_by(year, r2) %>%
  dplyr::mutate(basic_sani_r2 = mean(basic_sani_prop, na.rm=TRUE)) %>%
  dplyr::ungroup() %>%
  dplyr::group_by(year, r1) %>%
  dplyr::mutate(basic_sani_r1 = mean(basic_sani_prop, na.rm=TRUE)) %>%
  dplyr::ungroup()%>%
  dplyr::arrange(rgn_id, year)


##First gapfill with r2, if no value available use r1; create column indicating whether value was gapfilled and if so, by what method. Give NA to inhabited regions
sani_georgn_gf <- sani_georgn_gf %>%
  dplyr::mutate(method = ifelse(is.na(basic_sani_prop) & !is.na(basic_sani_r2), "UN georegion avg. (r2)", method)) %>%
  dplyr::mutate(method = ifelse(is.na(basic_sani_prop) & is.na(basic_sani_r2) & !is.na(basic_sani_r1), "UN georegion avg (r1)", method))%>%
  dplyr::mutate(basic_sani_prop = ifelse(is.na(basic_sani_prop) & !is.na(basic_sani_r2), basic_sani_r2, basic_sani_prop)) %>%
  dplyr::mutate(basic_sani_prop = ifelse(is.na(basic_sani_prop) & !is.na(basic_sani_r1), basic_sani_r1, basic_sani_prop)) %>%
  dplyr::select(rgn_id, rgn_label, year, basic_sani_prop, method)

##See regions that have not been gapfilled. 
dplyr::filter(sani_georgn_gf, is.na(basic_sani_prop)) %>% 
  dplyr::select(rgn_id, basic_sani_prop) %>% 
  unique() %>%
  data.frame() #NA values for inhabitated regions. 

Standarizing sanitation data by population density

First calculate coastal population density (people/km2) is calculated by dividing the population within 25 miles of the coast by km2 within the 25 mile inland coastal area (yes! This is confusing because area is in km^2, despite the boundary being 25 miles inland).

These data are transformed to a pressure, with a zero score indicating no pressure and 1 indicating the highest possible pressure. Given this we want to determine the number of people without access.

The number of people per km^2 without access to sanitation is calculated by:

  1. converting proportion with access to sanitation to proportion without access to sanitation (i.e., 1 - proportion_with_access).
  2. The proportion without access is multiplied by the coastal population density.
  3. Number of people without access are log transformed (ln(x+1))

Pressure Score

The reference point is the 99th quantile across all countries and years 2000-2009 as a reference point.

##Calculate reference point
ref_calc <- unsani_pop %>% 
  dplyr::filter(year %in% 2000:2009) %>% #years of reference
  ##summarise(ref= max(propWO_x_pop_log, na.rm = TRUE)*1.1) %>%  # old method
  dplyr::summarise(ref= quantile(propWO_x_pop_log, probs=c(0.99), na.rm = TRUE)) %>% 
  .$ref

ref_calc
## save to the master reference point list - new folder might need to be created for assessment year if this file does not already exist.
master_refs <- read.csv(here("globalprep/supplementary_information/v2018/reference_points_pressures.csv"), stringsAsFactors = FALSE)

master_refs$ref_point[master_refs$pressure == "Sanitation"] <- ref_calc

write.csv(master_refs, "globalprep/supplementary_information/v2019/reference_points_pressures.csv", row.names=FALSE)

master_refs <- read.csv(here("globalprep/supplementary_information/v2019/reference_points_pressures.csv")) 
ref_value <- as.numeric(as.character(master_refs$ref_point[master_refs$pressure == "Sanitation"])) 
ref_value #7.10

unsani_prs <- unsani_pop %>%
  dplyr::mutate(pressure_score = propWO_x_pop_log / ref_value) %>% 
  dplyr::mutate(pressure_score = ifelse(pressure_score>1, 1, pressure_score)) %>% #limits pressure scores not to be higher than 1
  dplyr::select(rgn_id, year, pressure_score) 

summary(unsani_prs)

#Save data pressure scores 
write_csv(unsani_prs, here("globalprep/prs_cw_pathogen/v2019/output/po_pathogen_popdensity25mi.csv"))

# Compare to v2018 data

unsani_prs_old <- read_csv(here("globalprep/prs_cw_pathogen/v2018/output/po_pathogen_popdensity25mi.csv")) %>%
  rename(pressure_score_2018 = pressure_score) %>% 
  left_join(unsani_prs, by=c("rgn_id", "year")) %>%
  filter(year == 2015)

filter(unsani_prs_old, rgn_id %in% c(185, 208))

  ggplotly(ggplot(unsani_prs_old, aes(y = pressure_score, x = pressure_score_2018, labels = rgn_id)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = "red"))

Model Trend

Using CalculateTrend function form the ohicore, trend is calculated by applying a linear regression model to the pressuere scores using a window of 5 years of data. The solope of the linear regression (annual change in pressure) is then divided by the earliest year to get proportional change and then multiplied by 5 to get estimate trend on pressure in the next five years.

Compare to previous years

Compare results

We checked the main discrepancies and these were due to changes in source data. Fairly small changes in access can lead to fairly large changes in pressure scores, depending on the population. There are slightly higher pressure scores this year (indicated by points tending to be above the 1-1 red line) due to modifications of reference point calculations.

Outlier exploration

### Comparison of basic access to sanitation scores 
sani_raw <- read.csv(file.path(dir_M, "git-annex/globalprep/_raw_data/WHO_UNICEFF/d2019/JMP_2019_WLD.csv"), header=FALSE, sep = ",", na.strings = c(NA, ''), stringsAsFactors = FALSE, strip.white = TRUE)

sani_raw_old <- read.csv(file.path(dir_M, "git-annex/globalprep/_raw_data/WHO_UNICEFF/d2018/JMP_2017_WLD.csv"), header=FALSE, sep = ",", na.strings = c(NA, ''), stringsAsFactors = FALSE, strip.white = TRUE)

sani_old <- sani_raw_old %>%
  dplyr::slice(-1) %>%                # cut first row with column names
  dplyr::select(country = V1,
         year = V3,
         pop = V4,
         basic_pct = V6) %>%
  dplyr::mutate(basic_pct = ifelse(basic_pct=="-", NA, basic_pct)) %>% 
  dplyr::mutate(pop = stringr::str_remove_all(pop, pattern = " ")) %>%
  dplyr::mutate(basic_pct = ifelse(stringr::str_detect(basic_pct,">99"), 99.5, basic_pct)) %>%
  dplyr::mutate_at(.vars = c("pop", "basic_pct", "year"), .funs = as.numeric)%>%
  dplyr::mutate(pop = pop * 1000,
          basic_prop = basic_pct/100) %>%
  dplyr::filter(!is.na(year))

sani_old_outliers <- sani_old %>% 
  filter(country == "Maldives" | country == "Nigeria" | country == "Bangladesh" | country == "Singapore" | country == "Tuvalu" | country == "Russian Federation"  | country == "Marshall Islands" | country == "Nauru") %>% 
  rename(pop_2018 = pop, basic_pct_2018 = basic_pct, basic_prop_2018 = basic_prop)
  
sani_compare <- sani %>%
  filter(country == "Maldives" | country == "Nigeria" | country == "Bangladesh" | country == "Singapore" | country == "Tuvalu" | country == "Russian Federation"  | country == "Marshall Islands" | country == "Nauru") %>% 
  left_join(sani_old_outliers, by=c("country", "year")) %>% 
  select(country, year, pop, pop_2018, basic_pct, basic_pct_2018)

### Comparison of pressure scores

unsani_prs_old <- read_csv(here("globalprep/prs_cw_pathogen/v2018/output/po_pathogen_popdensity25mi.csv")) %>%
  rename(pressure_score_2018 = pressure_score) %>% 
  left_join(unsani_prs, by=c("rgn_id", "year")) %>% 
  filter(rgn_id %in% c(39, 196, 204, 208, 19, 73, 11, 10))


unsani_prs_old_all <- read_csv(here("globalprep/prs_cw_pathogen/v2018/output/po_pathogen_popdensity25mi.csv")) %>%
  rename(pressure_score_2018 = pressure_score) %>% 
  left_join(unsani_prs, by=c("rgn_id", "year")) %>% 
  mutate(diff = pressure_score-pressure_score_2018)