Summary
This script gapfills the Social Progress Index (SPI) data and formats it for the OHI global assessment.
Updates
Now 10 years of data included in SPI. Created updated spi_categories.csv in Mazu. See the 2020 Methodology Report cited below for a detailed description of the SPI changes from 2018 to 2019. This is saved in Mazu and can also be downloaded here.
Citation: http://www.socialprogress.org/
Stern, S., A. Wares and T. Epner. 2020. Social Progress Index: 2019 Methodology Report.
Source information: https://www.socialprogress.org/download https://www.socialprogress.org/index/global/methodology
Date Downloaded: 09/10/2020
Time range: 2011-2020
Native data resolution: country scores
Format: Excel file
Description: Social Progress Index scores and components for countries.
# load libraries, set directories
library(ohicore) #devtools::install_github('ohi-science/ohicore@dev')
library(dplyr)
library(stringr)
library(tidyr)
library(Hmisc)
library(here)
library(validate)
library(tidyverse)
library(readxl)
## comment out when knitting
setwd(here::here("globalprep","prs_res_spi","v2020"))
### Load FAO-specific user-defined functions
source('../../../workflow/R/common.R') # directory locations
## This file makes it easier to process data for the OHI global assessment
## by creating the following objects:
##
## * dir_M = identifies correct file path to Mazu (internal server) based on your operating system
## * mollCRS = the crs code for the mollweide coordinate reference system we use in the global assessment
## * regions_shape() = function to load global shapefile for land/eez/high seas/antarctica regions
## * ohi_rasters() = function to load two rasters: global eez regions and ocean region
## * region_data() = function to load 2 dataframes describing global regions
## * rgn_syns() = function to load dataframe of region synonyms (used to convert country names to OHI regions)
## * low_pop() = function to load dataframe of regions with low and no human population
## * UNgeorgn = function to load dataframe of UN geopolitical designations used to gapfill missing data
Social Progress Index data
Organize data and gapfill missing countries that have incomplete data. This index is comprised of 3 indicators, which are each comprised of 4 subindicators, which are comprised of several datasets. If one of the subindicators are missing, the SPI is not calculated. The first round of gapfilling involves using relationships between the the subindicators to gapfill missing data. If a region is missing all subindicator data, then a second round of gapfilling is performed using relationships between UN geopolitical regions and the World Governance Indicator to gapfill the SPI score.
The following gets all years of data.
cats <- read.csv(file.path(dir_M, 'git-annex/globalprep/_raw_data/SocialProgressIndex/d2020/spi_categories.csv')) %>%
mutate(subcategory = gsub(" ", "", subcategory))
spi_raw <- read_excel(file.path(dir_M, "git-annex/globalprep/_raw_data/SocialProgressIndex/d2020/2011-2020-Social-Progress-Index.xlsx"), sheet = 2)
names(spi_raw) <- gsub(" ", "", names(spi_raw))
all_spi <- spi_raw %>%
dplyr::select(-`SPIRank`, -`SPIcountrycode`, -Status) %>%
dplyr::filter(Country != "World") %>%
gather('subcategory', 'score', -Country, -`SPIyear`) %>%
dplyr::filter(subcategory %in% cats$subcategory) %>%
left_join(cats, by = "subcategory") %>%
rename("year" = "SPIyear")
Gapfilling: Step 1
In this case, we use relationships between the subindicators to estimate missing data.
Gapfill Basic Human Need (bhn) indicator
set.seed(227)
bhn_subs <- all_spi %>%
dplyr::filter(category %in% c("bhn")) %>%
spread(subcategory, score)
bhn_tmp <- all_spi %>%
dplyr::filter(category %in% c("bhn_score")) %>%
select(Country, year, bhn_score=score) %>%
left_join(bhn_subs, by=c("Country", "year"))
bhn_tmp <- bhn_tmp %>%
rowwise() %>%
mutate(NA_tot = sum(is.na(NutritionandBasicMedicalCare), is.na(PersonalSafety), is.na(Shelter),
is.na(WaterandSanitation)))
## Ideally most NA_tot values are 0
hist(bhn_tmp$NA_tot)
##
## 0 1 2 4
## 1650 127 60 90
#library(mice)
# md.pattern(select(bhn_tmp, -(1:4))) # mice package
imputes <- 50
bhn_gf <- aregImpute(~ WaterandSanitation + Shelter + NutritionandBasicMedicalCare + PersonalSafety,
data = bhn_tmp, type = "regression", n.impute = imputes)
## Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6
Iteration 7
Iteration 8
Iteration 9
Iteration 10
Iteration 11
Iteration 12
Iteration 13
Iteration 14
Iteration 15
Iteration 16
Iteration 17
Iteration 18
Iteration 19
Iteration 20
Iteration 21
Iteration 22
Iteration 23
Iteration 24
Iteration 25
Iteration 26
Iteration 27
Iteration 28
Iteration 29
Iteration 30
Iteration 31
Iteration 32
Iteration 33
Iteration 34
Iteration 35
Iteration 36
Iteration 37
Iteration 38
Iteration 39
Iteration 40
Iteration 41
Iteration 42
Iteration 43
Iteration 44
Iteration 45
Iteration 46
Iteration 47
Iteration 48
Iteration 49
Iteration 50
Iteration 51
Iteration 52
Iteration 53
##
## Multiple Imputation using Bootstrap and PMM
##
## aregImpute(formula = ~WaterandSanitation + Shelter + NutritionandBasicMedicalCare +
## PersonalSafety, data = bhn_tmp, n.impute = imputes, type = "regression")
##
## n: 1927 p: 4 Imputations: 50 nk: 3
##
## Number of NAs:
## WaterandSanitation Shelter
## 90 100
## NutritionandBasicMedicalCare PersonalSafety
## 150 267
##
## type d.f.
## WaterandSanitation s 2
## Shelter s 2
## NutritionandBasicMedicalCare s 2
## PersonalSafety s 1
##
## Transformation of Target Variables Forced to be Linear
##
## R-squares for Predicting Non-Missing Values for Each Variable
## Using Last Imputations of Predictors
## WaterandSanitation Shelter
## 0.922 0.888
## NutritionandBasicMedicalCare PersonalSafety
## 0.930 0.531
# to get mean and sd of all imputations
impute_scores_all <- data.frame()
for (imp in 1:imputes){
#imp = 1
imputed <- impute.transcan(bhn_gf, imputation=imp, data=bhn_tmp, list.out=TRUE,
pr=FALSE, check=FALSE)
subcat_data <- data.frame(imputed)
impute_scores <- data.frame(Country = bhn_tmp$Country,
year = bhn_tmp$year,
imputation = imp)
impute_scores <- cbind(impute_scores, subcat_data)
impute_scores_all <- rbind(impute_scores_all, impute_scores)
}
## Convert class of value columns from `impute` to `numeric` to avoid "Warning message: attributes are not identical across measure variables; they will be dropped" when gathering columns
impute_scores_all$WaterandSanitation <- as.numeric(impute_scores_all$WaterandSanitation)
impute_scores_all$Shelter <- as.numeric(impute_scores_all$Shelter)
impute_scores_all$NutritionandBasicMedicalCare <- as.numeric(impute_scores_all$NutritionandBasicMedicalCare)
impute_scores_all$PersonalSafety <- as.numeric(impute_scores_all$PersonalSafety)
impute_scores_summary <- impute_scores_all %>%
gather("subcategory", "score", -(1:3)) %>%
group_by(Country, year, subcategory) %>%
dplyr::summarize(score_predict = mean(score),
sd_score_predict = sd(score)) %>%
ungroup()
## `summarise()` regrouping output by 'Country', 'year' (override with `.groups` argument)
bhn_tmp_long <- bhn_tmp %>%
select(Country, year, bhn_score, NA_tot, NutritionandBasicMedicalCare,
PersonalSafety, Shelter, WaterandSanitation) %>%
gather("subcategory", "score", -(1:4)) %>%
left_join(impute_scores_summary, by=c("Country", "year", "subcategory"))
## Should be no NAs in column sd_score_predict or score_predict
summary(bhn_tmp_long)
## Country year bhn_score NA_tot
## Length:7708 Min. :2011 Min. :15.39 Min. :0.000
## Class :character 1st Qu.:2013 1st Qu.:57.77 1st Qu.:0.000
## Mode :character Median :2016 Median :81.86 Median :0.000
## Mean :2016 Mean :74.44 Mean :0.315
## 3rd Qu.:2018 3rd Qu.:89.80 3rd Qu.:0.000
## Max. :2020 Max. :98.50 Max. :4.000
## NA's :1108
## subcategory score score_predict sd_score_predict
## Length:7708 Min. : 5.54 Min. : 5.54 Min. : 0.000
## Class :character 1st Qu.:59.50 1st Qu.:60.37 1st Qu.: 0.000
## Mode :character Median :80.05 Median :78.88 Median : 0.000
## Mean :74.72 Mean :74.48 Mean : 0.892
## 3rd Qu.:93.24 3rd Qu.:92.67 3rd Qu.: 0.000
## Max. :99.80 Max. :99.80 Max. :29.140
## NA's :607
bhn_data <- bhn_tmp_long %>%
mutate(score = ifelse(is.na(score) & NA_tot < 4, score_predict, score)) %>%
group_by(Country, year) %>%
dplyr::summarize(
score = mean(score),
NA_tot = mean(NA_tot)) %>%
ungroup() %>%
mutate(category = "bhn")
## `summarise()` regrouping output by 'Country' (override with `.groups` argument)
## Country year score NA_tot
## Length:1927 Min. :2011 Min. :15.39 Min. :0.000
## Class :character 1st Qu.:2013 1st Qu.:59.80 1st Qu.:0.000
## Mode :character Median :2016 Median :81.51 Median :0.000
## Mean :2016 Mean :74.71 Mean :0.315
## 3rd Qu.:2018 3rd Qu.:89.43 3rd Qu.:0.000
## Max. :2020 Max. :98.50 Max. :4.000
## NA's :90
## category
## Length:1927
## Class :character
## Mode :character
##
##
##
##
Gapfill Opportunity (op) indicator
set.seed(227)
op_subs <- all_spi %>%
dplyr::filter(category %in% c("op")) %>%
spread(subcategory, score)
op_tmp <- all_spi %>%
dplyr::filter(category %in% c("op_score")) %>%
select(Country, year, op_score=score) %>%
left_join(op_subs, by=c("Country", "year"))
op_tmp <- op_tmp %>%
rowwise() %>%
mutate(NA_tot = sum(is.na(AccesstoAdvancedEducation), is.na(PersonalFreedomandChoice), is.na(PersonalRights),
is.na(Inclusiveness)))
## Ideally most NA_tot values are 0
hist(op_tmp$NA_tot)
##
## 0 1 2 3 4
## 1667 30 50 20 160
#md.pattern(select(op_tmp, -(1:4))) # mice package
imputes <- 50
op_gf <- aregImpute(~ AccesstoAdvancedEducation + PersonalFreedomandChoice + PersonalRights + Inclusiveness,
data = op_tmp, type = "regression", n.impute = imputes)
## Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6
Iteration 7
Iteration 8
Iteration 9
Iteration 10
Iteration 11
Iteration 12
Iteration 13
Iteration 14
Iteration 15
Iteration 16
Iteration 17
Iteration 18
Iteration 19
Iteration 20
Iteration 21
Iteration 22
Iteration 23
Iteration 24
Iteration 25
Iteration 26
Iteration 27
Iteration 28
Iteration 29
Iteration 30
Iteration 31
Iteration 32
Iteration 33
Iteration 34
Iteration 35
Iteration 36
Iteration 37
Iteration 38
Iteration 39
Iteration 40
Iteration 41
Iteration 42
Iteration 43
Iteration 44
Iteration 45
Iteration 46
Iteration 47
Iteration 48
Iteration 49
Iteration 50
Iteration 51
Iteration 52
Iteration 53
##
## Multiple Imputation using Bootstrap and PMM
##
## aregImpute(formula = ~AccesstoAdvancedEducation + PersonalFreedomandChoice +
## PersonalRights + Inclusiveness, data = op_tmp, n.impute = imputes,
## type = "regression")
##
## n: 1927 p: 4 Imputations: 50 nk: 3
##
## Number of NAs:
## AccesstoAdvancedEducation PersonalFreedomandChoice PersonalRights
## 190 180 220
## Inclusiveness
## 240
##
## type d.f.
## AccesstoAdvancedEducation s 2
## PersonalFreedomandChoice s 2
## PersonalRights s 2
## Inclusiveness s 1
##
## Transformation of Target Variables Forced to be Linear
##
## R-squares for Predicting Non-Missing Values for Each Variable
## Using Last Imputations of Predictors
## AccesstoAdvancedEducation PersonalFreedomandChoice PersonalRights
## 0.731 0.745 0.714
## Inclusiveness
## 0.788
# to get mean and sd of all imputations
impute_scores_all <- data.frame()
for (imp in 1:imputes){
#imp = 1
## The imputed score values from 0 to 100
imputed <- impute.transcan(op_gf, imputation=imp, data=op_tmp, list.out=TRUE,
pr=FALSE, check=FALSE)
subcat_data <- data.frame(imputed)
impute_scores <- data.frame(Country = op_tmp$Country,
year = op_tmp$year,
imputation = imp)
impute_scores <- cbind(impute_scores, subcat_data)
impute_scores_all <- rbind(impute_scores_all, impute_scores)
}
## Convert columns from impute to numeric just to be safe and to prevent warning message from gathering
impute_scores_all$AccesstoAdvancedEducation <- as.numeric(impute_scores_all$AccesstoAdvancedEducation)
impute_scores_all$PersonalFreedomandChoice <- as.numeric(impute_scores_all$PersonalFreedomandChoice)
impute_scores_all$PersonalRights <- as.numeric(impute_scores_all$PersonalRights)
impute_scores_all$Inclusiveness <- as.numeric(impute_scores_all$Inclusiveness)
impute_scores_summary <- impute_scores_all %>%
gather("subcategory", "score", -(1:3)) %>%
group_by(Country, year, subcategory) %>%
dplyr::summarize(score_predict = mean(score),
sd_score_predict = sd(score)) %>%
ungroup()
## `summarise()` regrouping output by 'Country', 'year' (override with `.groups` argument)
op_tmp_long <- op_tmp %>%
select(Country, year, op_score, NA_tot, AccesstoAdvancedEducation,
PersonalFreedomandChoice, PersonalRights, Inclusiveness) %>%
gather("subcategory", "score", -(1:4)) %>%
left_join(impute_scores_summary, by=c("Country", "year", "subcategory"))
## should have no NAs in score_predict and sd_score_predict
summary(op_tmp_long)
## Country year op_score NA_tot
## Length:7708 Min. :2011 Min. :18.40 Min. :0.0000
## Class :character 1st Qu.:2013 1st Qu.:45.78 1st Qu.:0.0000
## Mode :character Median :2016 Median :55.86 Median :0.0000
## Mean :2016 Mean :57.24 Mean :0.4307
## 3rd Qu.:2018 3rd Qu.:70.14 3rd Qu.:0.0000
## Max. :2020 Max. :89.08 Max. :4.0000
## NA's :1040
## subcategory score score_predict sd_score_predict
## Length:7708 Min. : 4.00 Min. : 4.00 Min. : 0.000
## Class :character 1st Qu.:40.62 1st Qu.:41.67 1st Qu.: 0.000
## Mode :character Median :57.49 Median :57.35 Median : 0.000
## Mean :57.46 Mean :57.38 Mean : 1.612
## 3rd Qu.:74.43 3rd Qu.:73.15 3rd Qu.: 0.000
## Max. :98.57 Max. :98.57 Max. :27.581
## NA's :830
op_data <- op_tmp_long %>%
mutate(score = ifelse(is.na(score) & NA_tot < 4, score_predict, score)) %>%
group_by(Country, year) %>%
dplyr::summarize( #op_score_old = mean(op_score), # used this to test to make sure all is well, try `check_validate( )` below
score = mean(score),
NA_tot = mean(NA_tot)) %>%
ungroup() %>%
mutate(category = "op")
## `summarise()` regrouping output by 'Country' (override with `.groups` argument)
## Country year score NA_tot
## Length:1927 Min. :2011 Min. :18.39 Min. :0.0000
## Class :character 1st Qu.:2013 1st Qu.:46.34 1st Qu.:0.0000
## Mode :character Median :2016 Median :56.24 Median :0.0000
## Mean :2016 Mean :57.58 Mean :0.4307
## 3rd Qu.:2018 3rd Qu.:70.19 3rd Qu.:0.0000
## Max. :2020 Max. :89.08 Max. :4.0000
## NA's :160
## category
## Length:1927
## Class :character
## Mode :character
##
##
##
##
Gapfill Foundations of Wellbeing (fw) indicator
set.seed(227)
fw_subs <- all_spi %>%
dplyr::filter(category %in% c("fw")) %>%
spread(subcategory, score)
fw_tmp <- all_spi %>%
dplyr::filter(category %in% c("fw_score")) %>%
select(Country, year, fw_score=score) %>%
left_join(fw_subs, by=c("Country", "year"))
fw_tmp <- fw_tmp %>%
rowwise() %>%
mutate(NA_tot = sum(is.na(AccesstoBasicKnowledge), is.na(AccesstoInformationandCommunications), is.na(EnvironmentalQuality),is.na(HealthandWellness)))
## most values should be 0
hist(fw_tmp$NA_tot)
##
## 0 3 4
## 1747 90 90
#md.pattern(select(fw_tmp, -(1:4))) # mice package
imputes <- 50
fw_gf <- aregImpute(~ AccesstoBasicKnowledge + AccesstoInformationandCommunications + EnvironmentalQuality + HealthandWellness,
data = fw_tmp, type = "regression", n.impute = imputes)
## Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6
Iteration 7
Iteration 8
Iteration 9
Iteration 10
Iteration 11
Iteration 12
Iteration 13
Iteration 14
Iteration 15
Iteration 16
Iteration 17
Iteration 18
Iteration 19
Iteration 20
Iteration 21
Iteration 22
Iteration 23
Iteration 24
Iteration 25
Iteration 26
Iteration 27
Iteration 28
Iteration 29
Iteration 30
Iteration 31
Iteration 32
Iteration 33
Iteration 34
Iteration 35
Iteration 36
Iteration 37
Iteration 38
Iteration 39
Iteration 40
Iteration 41
Iteration 42
Iteration 43
Iteration 44
Iteration 45
Iteration 46
Iteration 47
Iteration 48
Iteration 49
Iteration 50
Iteration 51
Iteration 52
Iteration 53
##
## Multiple Imputation using Bootstrap and PMM
##
## aregImpute(formula = ~AccesstoBasicKnowledge + AccesstoInformationandCommunications +
## EnvironmentalQuality + HealthandWellness, data = fw_tmp,
## n.impute = imputes, type = "regression")
##
## n: 1927 p: 4 Imputations: 50 nk: 3
##
## Number of NAs:
## AccesstoBasicKnowledge AccesstoInformationandCommunications
## 180 180
## EnvironmentalQuality HealthandWellness
## 90 180
##
## type d.f.
## AccesstoBasicKnowledge s 2
## AccesstoInformationandCommunications s 2
## EnvironmentalQuality s 2
## HealthandWellness s 1
##
## Transformation of Target Variables Forced to be Linear
##
## R-squares for Predicting Non-Missing Values for Each Variable
## Using Last Imputations of Predictors
## AccesstoBasicKnowledge AccesstoInformationandCommunications
## 0.585 0.698
## EnvironmentalQuality HealthandWellness
## 0.188 0.714
# to get mean and sd of all imputations
impute_scores_all <- data.frame()
for (imp in 1:imputes){
#imp = 1
imputed <- impute.transcan(fw_gf, imputation=imp, data=fw_tmp, list.out=TRUE,
pr=FALSE, check=FALSE)
subcat_data <- data.frame(imputed)
impute_scores <- data.frame(Country = fw_tmp$Country,
year = fw_tmp$year,
imputation = imp)
impute_scores <- cbind(impute_scores, subcat_data)
impute_scores_all <- rbind(impute_scores_all, impute_scores)
}
## Convert columns from impute to numeric just to be safe and to prevent warning message from gathering. Due to some impute values having *
impute_scores_all$AccesstoBasicKnowledge <- as.numeric(impute_scores_all$AccesstoBasicKnowledge)
impute_scores_all$AccesstoInformationandCommunications <- as.numeric(impute_scores_all$AccesstoInformationandCommunications)
impute_scores_all$EnvironmentalQuality <- as.numeric(impute_scores_all$EnvironmentalQuality)
impute_scores_all$HealthandWellness <- as.numeric(impute_scores_all$HealthandWellness)
impute_scores_summary <- impute_scores_all %>%
gather("subcategory", "score", -(1:3)) %>%
group_by(Country, year, subcategory) %>%
dplyr::summarize(score_predict = mean(score),
sd_score_predict = sd(score)) %>%
ungroup()
## `summarise()` regrouping output by 'Country', 'year' (override with `.groups` argument)
fw_tmp_long <- fw_tmp %>%
select(Country, year, fw_score, NA_tot, AccesstoBasicKnowledge,
AccesstoInformationandCommunications, EnvironmentalQuality, HealthandWellness) %>%
gather("subcategory", "score", -(1:4)) %>%
left_join(impute_scores_summary, by=c("Country", "year", "subcategory"))
summary(fw_tmp_long)
## Country year fw_score NA_tot
## Length:7708 Min. :2011 Min. :30.82 Min. :0.0000
## Class :character 1st Qu.:2013 1st Qu.:56.33 1st Qu.:0.0000
## Mode :character Median :2016 Median :67.91 Median :0.0000
## Mean :2016 Mean :67.35 Mean :0.3269
## 3rd Qu.:2018 3rd Qu.:78.46 3rd Qu.:0.0000
## Max. :2020 Max. :93.89 Max. :4.0000
## NA's :720
## subcategory score score_predict sd_score_predict
## Length:7708 Min. : 0.34 Min. : 0.34 Min. : 0.000
## Class :character 1st Qu.:53.86 1st Qu.:54.55 1st Qu.: 0.000
## Mode :character Median :70.39 Median :70.09 Median : 0.000
## Mean :67.57 Mean :67.47 Mean : 1.264
## 3rd Qu.:83.32 3rd Qu.:82.47 3rd Qu.: 0.000
## Max. :99.25 Max. :99.25 Max. :26.699
## NA's :630
fw_data <- fw_tmp_long %>%
mutate(score = ifelse(is.na(score) & NA_tot < 4, score_predict, score)) %>%
group_by(Country, year) %>%
dplyr::summarize( #fw_score_old = mean(fw_score), # used this to test to make sure all is well, try `check_that( )` below
score = mean(score),
NA_tot = mean(NA_tot)) %>%
ungroup() %>%
mutate(category = "fw")
## `summarise()` regrouping output by 'Country' (override with `.groups` argument)
Combine data to get Social Progress Index
The next step averages the 3 indicators and identifies gapfilling. All the rows where all four indicators are missing NA_tot = 4
will have an NA in the score
column.
spi_calc <- rbind(bhn_data, op_data, fw_data) %>%
mutate(gapfill = paste(category, NA_tot, sep = "_")) %>%
mutate(score = ifelse(score > 100, 100, score),
score = ifelse(score < 0, 0, score)) %>%
group_by(Country, year) %>%
dplyr::summarize(
score = mean(score),
method = paste(unique(gapfill), collapse=", "),
gapfill = sum(NA_tot)) %>%
ungroup() %>%
mutate(method = ifelse(gapfill == 0, NA, method),
gapfill = ifelse(gapfill >= 1, 1, 0)) %>% # change values to just yes or no
data.frame()
## `summarise()` regrouping output by 'Country' (override with `.groups` argument)
## Country year score method
## Length:1927 Min. :2011 Min. :27.60 Length:1927
## Class :character 1st Qu.:2013 1st Qu.:53.56 Class :character
## Mode :character Median :2016 Median :68.30 Mode :character
## Mean :2016 Mean :66.52
## 3rd Qu.:2018 3rd Qu.:78.71
## Max. :2020 Max. :93.08
## NA's :170
## gapfill
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.1489
## 3rd Qu.:0.0000
## Max. :1.0000
##
## Some scores might still be NA because no values exist for that country
# check:
check <- all_spi %>%
filter(subcategory=="SocialProgressIndex") %>%
dplyr::select(Country, year, true_score=score) %>%
left_join(spi_calc, by = c("Country", "year"))
## Perfect
plot(check$true_score, check$score)
abline(0,1, col="red")
write.csv(spi_calc, "int/Country_spi.csv", row.names=FALSE)
Assign countries to OHI regions
spi_calc <- read.csv("int/Country_spi.csv", stringsAsFactors = FALSE)
spi_calc <- spi_calc %>%
mutate(Country = case_when(
Country=="C<f4>te d'Ivoire" ~ "Cote d'Ivoire",
Country=="R<e9>union" ~ "Reunion",
Country=="Cura<e7>ao" ~ "Curacao",
Country=="Korea, Democratic Republic of" ~ "North Korea",
Country=="St. Helena" ~ "Saint Helena",
Country=="St. Martin" ~ "Northern Saint-Martin",
TRUE ~ Country # Everything else, leave it be
))
# Channel Islands are Jersey and Guernsey, but these are already in the data
spi_rgn <- name_2_rgn(df_in = spi_calc,
fld_name='Country',
flds_unique = c("year"))
##
## These data were removed for not having any match in the lookup tables:
##
## Côte d'Ivoire Eswatini
## 1 1
## Republic of North Macedonia Vatican City
## 1 1
##
## These data were removed for not being of the proper rgn_type (eez,ohi_region) or mismatching region names in the lookup tables:
## tmp_type
## tmp_name disputed landlocked
## Afghanistan 0 10
## Andorra 0 10
## Armenia 0 10
## Austria 0 10
## Belarus 0 10
## Bhutan 0 10
## Bolivia 0 10
## Botswana 0 10
## Burkina Faso 0 10
## Burundi 0 10
## Central African Republic 0 10
## Chad 0 10
## Czechia 0 10
## Ethiopia 0 10
## Hungary 0 10
## Kazakhstan 0 10
## Kosovo 0 10
## Kyrgyzstan 0 10
## Laos 0 10
## Lesotho 0 10
## Liechtenstein 0 10
## Luxembourg 0 10
## Malawi 0 10
## Mali 0 10
## Moldova 0 10
## Mongolia 0 10
## Nepal 0 10
## Niger 0 10
## Paraguay 0 10
## Rwanda 0 10
## San Marino 0 10
## Serbia 0 10
## Slovakia 0 10
## South Sudan 0 7
## Switzerland 0 10
## Tajikistan 0 10
## Turkmenistan 0 10
## Uganda 0 10
## Uzbekistan 0 10
## West Bank and Gaza 10 0
## Zambia 0 10
## Zimbabwe 0 10
# Weight the following duplicates by population
# China, Guadeloupe, Guam, Hong Kong, Macao, Martinique, Northern Mariana Islands, Puerto Rico, Virgin Islands (U.S.)... Populations taken from /globalprep/supplementary_information/v2020/pop_weights.csv for the most part.. and google
pop_weights <- data.frame(Country = c("China", "Hong Kong", "Macao",
"Guadeloupe", "Martinique",
"Guam", "Northern Mariana Islands",
"Puerto Rico", "Virgin Islands (U.S.)"),
pop = c(1439323774, 7496988, 649342,
395700, 376480,
165768, 56882,
2860840, 104423))
spi_rgn <- spi_rgn %>%
left_join(pop_weights, by = "Country") %>%
mutate(pop = ifelse(is.na(pop), 1, pop)) %>%
group_by(rgn_id, rgn_name, year) %>%
dplyr::summarize(score = weighted.mean(score, pop, na.rm=TRUE),
method = paste(unique(method), collapse=" "),
gapfill = weighted.mean(gapfill, pop, na.rm=TRUE))%>%
ungroup() %>%
dplyr::mutate(year = as.numeric(year))
## `summarise()` regrouping output by 'rgn_id', 'rgn_name' (override with `.groups` argument)
Compare to WGI
WGI is a couple years behind the SPI, so we will use 2018 WGI for the 2019 and 2020 SPI. There is a strong correlation betwen the WGI and SPI indicators.
wgi <- read.csv('../../prs_res_wgi/v2020/output/wgi_res.csv') %>%
select(rgn_id, year, wgi_score = resilience_score)
length(unique(wgi$rgn_id))
## [1] 220
wgi_2019 <- wgi %>%
filter(year == 2018) %>%
mutate(year = 2019)
wgi_2020 <- wgi %>%
filter(year == 2018) %>%
mutate(year = 2020)
wgi <- rbind(wgi, wgi_2019, wgi_2020)
wgi_spi <- wgi %>%
left_join(spi_rgn, by=c("rgn_id", "year"))
plot(wgi_spi$wgi_score*100, wgi_spi$score, ylab="Social Progress Index, score", xlab="Worldwide Governance Score")
abline(0,1, col="red")
mod <- lm(score ~ wgi_score, data=wgi_spi)
summary(mod)
##
## Call:
## lm(formula = score ~ wgi_score, data = wgi_spi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.2443 -4.3401 -0.0147 4.4524 18.5688
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.4479 0.5559 54.77 <2e-16 ***
## wgi_score 75.5363 1.0410 72.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.731 on 1338 degrees of freedom
## (3500 observations deleted due to missingness)
## Multiple R-squared: 0.7974, Adjusted R-squared: 0.7972
## F-statistic: 5265 on 1 and 1338 DF, p-value: < 2.2e-16
Second round of gapfilling
In this case, UN geopolitical regions and WGI scores are used to estimate regions with no data. Based on this analysis, a model that includes WGI data and r2 UN geopolitical regions is the best model to predict missing SPI values.
years <- data.frame(year = min(spi_rgn$year):max(spi_rgn$year))
rgns_gf_un <- georegions %>%
merge(years) %>%
left_join(spi_rgn, by = c("rgn_id","year")) %>%
left_join(wgi, by = c("rgn_id", "year")) %>%
mutate(r2 = as.factor(r2)) %>%
mutate(r1 = as.factor(r1))
## Compare models to select a gapfilling method
mod1 <- lm(score ~ r2, data = rgns_gf_un, na.action="na.exclude")
mod2 <- lm(score ~ r2 + wgi_score, data = rgns_gf_un, na.action="na.exclude")
mod3 <- lm(score ~ r1, data = rgns_gf_un, na.action="na.exclude")
mod4 <- lm(score ~ r1 + wgi_score, data = rgns_gf_un, na.action="na.exclude")
mod5 <- lm(score ~ wgi_score, data = rgns_gf_un, na.action="na.exclude")
mod6 <- lm(score ~ r1 + poly(wgi_score, 2), data = rgns_gf_un, na.action = "na.exclude")
mod7 <- lm(score ~ r1 + poly(wgi_score, 3), data = rgns_gf_un, na.action = "na.exclude")
## the lowest AIC score is likely the best model
AIC(mod1, mod2, mod3, mod4, mod5, mod6, mod7)
## df AIC
## mod1 20 9546.803
## mod2 21 7806.484
## mod3 7 10000.908
## mod4 8 8294.064
## mod5 3 8916.915
## mod6 9 8257.783
## mod7 10 8256.915
## plot the models with the two lowest AIC scores
plot(predict(mod2), rgns_gf_un$score)
abline(0,1)
plot(predict(mod7), rgns_gf_un$score)
abline(0,1)
## Estimate missing data and gapfill
## need to add this because some R2 regions have no data
r2_regions <- unique(rgns_gf_un$r2[!is.na(rgns_gf_un$score)])
rgns_gf_un$r2 <- ifelse(rgns_gf_un$r2 %in% r2_regions, rgns_gf_un$r2, NA)
r1_regions <- unique(rgns_gf_un$r1[!is.na(rgns_gf_un$score)])
rgns_gf_un$r1 <- ifelse(rgns_gf_un$r1 %in% r1_regions, rgns_gf_un$r1, NA)
## Predict scores using r2 and wgi scores
mod_gf_r2 <- lm(score ~ r2 + wgi_score, data = rgns_gf_un, na.action = "na.exclude")
rgns_gf_un$score_pred_r2 <- predict(mod_gf_r2, newdata = rgns_gf_un[, c("r2", "wgi_score")])
## Predict scores using r1 and wgi scores
mod_gf_r1 <- lm(score ~ r1 + wgi_score, data = rgns_gf_un, na.action = na.exclude)
rgns_gf_un$score_pred_r1 <- predict(mod_gf_r1, newdata = rgns_gf_un[, c("r1", "wgi_score")])
## Predict scores just using wgi scores
mod_gf_wgi <- lm(score ~ wgi_score, data = rgns_gf_un, na.action = na.exclude)
rgns_gf_un$score_pred_wgi <- predict(mod_gf_wgi, newdata = data.frame(wgi_score = rgns_gf_un$wgi_score))
## Record gapfill methods
## Combine scores with predicted model scores
rgns_gf <- rgns_gf_un %>%
mutate(method = ifelse(is.na(score), "UN georgn & WGI", method)) %>%
mutate(gapfill = ifelse(is.na(score), "1", gapfill)) %>%
mutate(score = ifelse(is.na(score), score_pred_r2, score)) %>%
mutate(score = ifelse(is.na(score), score_pred_r1, score)) %>%
mutate(score = ifelse(is.na(score), score_pred_wgi, score)) %>%
select(rgn_id, year, score, method, gapfill) %>%
mutate(score = ifelse(score > 100, 100, score),
score = ifelse(score < 0, 0, score))
summary(rgns_gf) # should be no NA values
## rgn_id year score method
## Min. : 1.00 Min. :2011 Min. :29.68 Length:2200
## 1st Qu.: 58.75 1st Qu.:2013 1st Qu.:63.85 Class :character
## Median :116.50 Median :2016 Median :74.43 Mode :character
## Mean :117.64 Mean :2016 Mean :72.01
## 3rd Qu.:176.25 3rd Qu.:2018 3rd Qu.:83.41
## Max. :250.00 Max. :2020 Max. :96.61
## gapfill
## Length:2200
## Class :character
## Mode :character
##
##
##
length(unique(rgns_gf$rgn_id)) # should be 220 regions
## [1] 220
Uninhabited regions
These regions will receive an NA for their score (when established population is < 3000 people).
## uninhabited and low population regions
low_pop <- low_pop %>%
filter(est_population < 3000 | is.na(est_population))
rgns_gf_uninhab <- rgns_gf %>%
mutate(score = ifelse(rgn_id %in% low_pop$rgn_id, NA, score)) %>%
mutate(gapfill = ifelse(rgn_id %in% low_pop$rgn_id, NA, gapfill))
Save final data
#gf_2019 <- read_csv(file.path("../v2019/output/spi_gf.csv"))
gf_data <- rgns_gf_uninhab %>%
select(rgn_id, year, gapfill, method) %>%
mutate(gapfill = ifelse(is.na(method), 0, 1))
write.csv(gf_data, "output/spi_gf.csv", row.names=FALSE)
res_data <- rgns_gf_uninhab %>%
select(rgn_id, year, resilience_score=score) %>%
mutate(resilience_score = resilience_score/100)
write.csv(res_data, "output/spi_res.csv", row.names=FALSE)
prs_data <- rgns_gf_uninhab %>%
select(rgn_id, year, pressure_score=score) %>%
mutate(pressure_score = 1 - (pressure_score/100))
write.csv(prs_data, "output/spi_prs.csv", row.names=FALSE)
Data check
## Resilience comparison, filtered for 2019
new_spi_res <- read_csv(file.path("output/spi_res.csv")) %>%
filter(year == 2019) %>%
select(rgn_id, new_resilience_score = resilience_score)
## Parsed with column specification:
## cols(
## rgn_id = col_double(),
## year = col_double(),
## resilience_score = col_double()
## )
old_new <- read_csv("../v2019/output/spi_res.csv") %>%
filter(year == 2019) %>%
left_join(new_spi_res, by = 'rgn_id')
## Parsed with column specification:
## cols(
## rgn_id = col_double(),
## year = col_double(),
## resilience_score = col_double()
## )
plot(old_new$new_resilience_score, old_new$resilience_score)
abline(0, 1, col="red")
## Pressure score comparison, filtered for 2018
new_spi_prs <- read_csv(file.path("output/spi_prs.csv")) %>%
filter(year == 2018) %>%
select(rgn_id, new_pressure_score = pressure_score)
## Parsed with column specification:
## cols(
## rgn_id = col_double(),
## year = col_double(),
## pressure_score = col_double()
## )
old_new <- read_csv("../v2019/output/spi_prs.csv") %>%
filter(year == 2018) %>%
left_join(new_spi_prs, by = 'rgn_id')
## Parsed with column specification:
## cols(
## rgn_id = col_double(),
## year = col_double(),
## pressure_score = col_double()
## )
plot(old_new$new_pressure_score, old_new$pressure_score)
abline(0, 1, col="red")
## Pressure score comparison, All years
new_spi_prs <- read_csv(file.path("output/spi_prs.csv")) %>%
#filter(year == 2018) %>%
select(rgn_id, new_pressure_score = pressure_score)
## Parsed with column specification:
## cols(
## rgn_id = col_double(),
## year = col_double(),
## pressure_score = col_double()
## )
old_new <- read_csv("../v2019/output/spi_prs.csv") %>%
#filter(year == 2018) %>%
left_join(new_spi_prs, by = 'rgn_id')
## Parsed with column specification:
## cols(
## rgn_id = col_double(),
## year = col_double(),
## pressure_score = col_double()
## )
plot(old_new$new_pressure_score, old_new$pressure_score)
abline(0, 1, col="red")
Social Progress Index data
Organize data and gapfill missing countries that have incomplete data. This index is comprised of 3 indicators, which are each comprised of 4 subindicators, which are comprised of several datasets. If one of the subindicators are missing, the SPI is not calculated. The first round of gapfilling involves using relationships between the the subindicators to gapfill missing data. If a region is missing all subindicator data, then a second round of gapfilling is performed using relationships between UN geopolitical regions and the World Governance Indicator to gapfill the SPI score.
The following gets all years of data.
Gapfilling: Step 1
In this case, we use relationships between the subindicators to estimate missing data.
Gapfill Basic Human Need (bhn) indicator
Gapfill Opportunity (op) indicator
Gapfill Foundations of Wellbeing (fw) indicator
Combine data to get Social Progress Index
The next step averages the 3 indicators and identifies gapfilling. All the rows where all four indicators are missing
NA_tot = 4
will have an NA in thescore
column.Assign countries to OHI regions