Chapter 5 Visualizing: ggplot2

5.1 Why ggplot2?

This tutorial focuses exclusively on data visualization using ggplot2 because this package:

  • provides a coherent language for visualizing data (vs. the original ‘plot’ function, which developed in an ad-hoc way)
  • makes many tasks easier, such as: visualizing a third (z) variable; saving figures; automating plotting and formatting tasks
  • creates beautiful figures and is flexible

You should really use ggplot2 for nearly all data visualization!

5.2 Additional resources for data visualization in R

The goal of this tutorial is to introduce you to the basics of ggplot2. I focus only on the features I use most often. We will barely scratch the surface of ggplot2’s functionality. But there are many resources that will help you learn more. Here are some of the ones I use:

  • The official ggplot2 cheatsheet is amazing!
  • Winston Chang’s book converted me from someone who was slightly confused by ggplot2 to a superuser….so I can’t recommend it enough! And, the associated website is great too.



5.3 Objectives

  • install the ggplot2 package by installing tidyverse
  • learn basics of ggplot2 with a dataset describing global ocean health
  • practice writing a script in RMarkdown
  • practice the rstudio-github workflow

5.4 Install tidyverse: tidyverse

We will use the ggplot2 package, which is bundled with a composite-package called tidyverse. Check out tidyverse.org/ for more information.

We will download and install tidyverse:

## from CRAN:
install.packages("tidyverse") ## do this once only to install the package on your computer.

You will see some messages describing which packages were installed with tidyverse. Messages about name conflicts are also returned. This is not a problem its just an alert that two functions from dplyr will replace functions in the built-in stats package with the same name.

5.5 Create an RMarkdown file to work from


To do: Create a new R Markdown file

We will create the RMarkdown file you will be working from. You can write notes to yourself in Markdown and have a record of all your R code.

Go to File > New File > R Markdown … (or click the green plus in the top left corner).

Next, replace all the automatic text with the following:

---
title: "Graphics with ggplot2"
author: "Julie"
date: "11/21/2017"
output: html_document
---

# Learning ggplot2

We're learning ggplot2 It's going to be amazing. 

Now, let’s save the file. I’m going to call my file ggplot2.Rmd. Do a pull…then stage/commit (to save to git history)…then push to sync with GitHub.com.


5.6 Load the tidverse

We have downloaded the tidyverse package but it is not yet loaded in the R workspace. We will use the library() function to load tidyverse. This will be the first code chunk in your RMarkdown:

library(tidyverse) ## do this every time you restart R and need it 

5.7 Load and explore data

We will load the dataset we will be using for this tutorial into our workspace directly from GitHub.com (I always find this GitHub functionality so cool!).

ohi_data <- read_csv("https://raw.githubusercontent.com/OHI-Science/data-science-training/master/data/OHI_global_data.csv")

These data are from the global Ocean Health Index which assesses the condition of marine resources for 220 countries or territories.

The dataset includes:

variable description data type examples
country OHI region name for country or territorial regions with marine coastline character Jarvis Island, Italy
OHI_score Ocean Health Index scores describing condition of marine resources based on the 2017 global assessment numeric values can range from 0-100, with 100 being the best possible score
OHI_trend average annual change in OHI scores from 2012 to 2017 numeric values range from -3.9 to 1.86, with positive values indicating improving scores
coastal_pop human population within 25 miles of coast numeric values range from 0 to >317 million
log_coastal_pop log(coastal_pop + 1) numeric values range from 0 to ~20
cumulative_human_impact average cumulative impact of human stressors (e.g., SST, shipping, pollution) on marine ecosystems within the country’s marine boundaries numeric values range from 1.419 to 6.762, with larger values indicating higher impact
HDI Human Development Index scores, which measure average achievement in key dimensions of human development: a long and healthy life, being knowledgeable and have a decent standard of living. numeric values are rescaled to be between 0 to 1, with higher scores indicating better human development
georegion_one lower resolution UN georegion designations based on georegional and social similarities character Africa, Americas, Asia
georegion_two higher resolution UN georegion designations based on georegional and social similarities character Australia and New Zealand, Eastern Asia, Caribbean


Let’s explore these data, shall we:

head(ohi_data)
## # A tibble: 6 x 9
##   country OHI_score OHI_trend coastal_pop log_coastal_pop cumulative_huma…
##   <chr>       <dbl>     <dbl>       <dbl>           <dbl>            <dbl>
## 1 South …      92.0      1.67           0               0             2.82
## 2 Crozet…      88.1      0.81           0               0             1.92
## 3 Howlan…      87.7      0.08           0               0             2.97
## 4 Heard …      87.4      0.46           0               0             2.78
## 5 Kergue…      87.3      0.72           0               0             2.02
## 6 Jarvis…      83.9     -0.25           0               0             1.85
## # … with 3 more variables: HDI <dbl>, georegion_one <chr>,
## #   georegion_two <chr>
summary(ohi_data)
##    country            OHI_score       OHI_trend        coastal_pop       
##  Length:221         Min.   :41.83   Min.   :-3.9900   Min.   :        0  
##  Class :character   1st Qu.:61.62   1st Qu.:-0.5600   1st Qu.:   106224  
##  Mode  :character   Median :67.82   Median :-0.1800   Median :  1066507  
##                     Mean   :66.87   Mean   :-0.2261   Mean   : 11471246  
##                     3rd Qu.:72.21   3rd Qu.: 0.1900   3rd Qu.:  7081969  
##                     Max.   :91.98   Max.   : 1.8600   Max.   :317149893  
##                                                                          
##  log_coastal_pop cumulative_human_impact      HDI        
##  Min.   : 0.00   Min.   :1.419           Min.   :0.4140  
##  1st Qu.:11.57   1st Qu.:3.553           1st Qu.:0.5970  
##  Median :13.88   Median :3.896           Median :0.7400  
##  Mean   :12.64   Mean   :3.861           Mean   :0.7159  
##  3rd Qu.:15.77   3rd Qu.:4.351           3rd Qu.:0.8270  
##  Max.   :19.57   Max.   :6.762           Max.   :0.9490  
##                                          NA's   :76      
##  georegion_one      georegion_two     
##  Length:221         Length:221        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
## 
table(ohi_data$georegion_one)
## 
##                          Africa                        Americas 
##                              46                               3 
##                            Asia                          Europe 
##                              40                              43 
## Latin America and the Caribbean                         Oceania 
##                              48                              34 
##                Southern Islands 
##                               7
table(ohi_data$georegion_two)
## 
## Australia and New Zealand                 Caribbean 
##                         6                        28 
##           Central America            Eastern Africa 
##                         9                        17 
##              Eastern Asia            Eastern Europe 
##                         6                         5 
##                 Melanesia                Micronesia 
##                         5                        13 
##             Middle Africa           Northern Africa 
##                         7                         7 
##          Northern America           Northern Europe 
##                         3                        19 
##                 Polynesia             South America 
##                        10                        11 
##        South-Eastern Asia           Southern Africa 
##                        12                         2 
##             Southern Asia           Southern Europe 
##                         7                        14 
##          Southern Islands            Western Africa 
##                         7                        13 
##              Western Asia            Western Europe 
##                        15                         5

5.8 Plotting with ggplot2

Plots are created using a step-wise approach that is flexible and easily customized. We will walk through the steps needed to build a plot in ggplot2.

HINT: Start with the simplest version of your plot and build onto it.

5.8.1 Step 1: Identify the dataframe

ggplot2 requires the data to be in a dataframe format. The first step is to use the ggplot() function to identify the dataframe with the data you want to plot.

At this point, we will also assign the x and y axis variables within the aes function (this stands for aesthetics, and we will discuss this concept after we have a plot to work with).

[NOTE: This step was confusing for me for a long time because it doesn’t actually make the plot!]

ggplot(data = ohi_data, aes(x = OHI_score, y = HDI))

5.8.2 Step 2: Identify the style of plot you want to create (i.e., the geom)

Next, we will actually create the plot by adding a geom function using the + operator. A geom is the geometrical object that a plot uses to represent data. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on.

You can use different geoms to plot the same data. To change the type of plot, change the geom function that you add to ggplot().

ggplot2 provides over 30 geoms, which are described in the official ggplot2 cheatsheet. To learn more about any single geom, use help: ?geom_smooth. Extension packages provide even more geoms (see https://www.ggplot2-exts.org for a sampling).

We will start by creating a scatterplot of ohi scores within UN georegions by adding the geom_point function.

ggplot(data = ohi_data, aes(x = georegion_one, y =OHI_score)) + 
  geom_point()

Yay! A plot!

Next we will replace geom_point() with geom_jitter() to create a style of plot with the same data. geom_jitter is similar to geom_point, but it adds random variation to values along the x-axis to separate points:

ggplot(data = ohi_data, aes(x = georegion_one, y =OHI_score)) + 
  geom_jitter(width=0.2) # the width argument describes how much scatter to add


Now we will explore other geoms.

The following bar plot uses geom_bar to describe the number of countries in each UN category:

ggplot(data = ohi_data, aes(x = georegion_one)) + 
  geom_bar() 


A histogram using geom_histogram:

ggplot(data = ohi_data, aes(x = HDI)) + 
  geom_histogram() 


Multiple geoms can be layered on the same plot. Geom layers can be added using either the same or different dataframes. To demonstrate this we will use a secondary dataset with the mean OHI score for each UN georegion. We will create a bar plot of the means for each region and then overlay the point data.

ohi_summary <- read_csv("https://raw.githubusercontent.com/OHI-Science/data-science-training/master/data/OHI_scores_georegion_summary.csv")

ohi_summary
## # A tibble: 7 x 2
##   georegions                      OHI_score_average
##   <chr>                                       <dbl>
## 1 Africa                                       59.7
## 2 Americas                                     68.3
## 3 Asia                                         63.9
## 4 Europe                                       68.5
## 5 Latin America and the Caribbean              67.0
## 6 Oceania                                      75.0
## 7 Southern Islands                             80.1
ggplot(data = ohi_summary, aes(x = georegions, y = OHI_score_average)) + 
  geom_bar(stat="identity") +
  geom_jitter(data=ohi_data, aes(x=georegion_one, y=OHI_score))


In the above example, the global mappings are designated by: ggplot(data = ohi_summary, aes(x = georegions, y = OHI_score_average)). These are then overwritten by the local mappings designated in the geom_jitter layer. The global designations will apply to subsequent layers, unless otherwise changed. This makes it possible to display different aesthetics and data in different layers.




STOP: let’s Commit, Pull and Push to GitHub And discuss the following

  1. In the code below, why isn’t the data showing up?
ggplot(data = ohi_data, aes(y = OHI_score, x = HDI, color=georegion_one))


  1. Are these two approaches the same?
ggplot(data = ohi_data, aes(y=OHI_score, x = HDI, color=georegion_one)) +
  geom_point()
  
ggplot(data = ohi_data) +
  geom_point(aes(y = OHI_score, x = HDI, color=georegion_one))  


Answers

  1. The code is missing a geom to describe how the data should be plotted.
  2. These two approaches result in the same plot here, but there could be downstream effects as more layers are added.



More about the aes function

The arguments within aes() link variables in the dataframe to some aspect of plot appearance. As we have discussed, x and y describe the axes, but other arguments can be added to describe a z variable (e.g. size or color or shape of points). Here are some examples of aes arguments:

  • color color of lines/points
  • fill color within polygons
  • label name
  • linetype type of line
  • shape style of point
  • alpha transparency (0-1)
  • size size of shape

This is very powerful! Let’s explore.

Changing the point size:

ggplot(data = ohi_data, aes(x = OHI_score, y = HDI, size = coastal_pop)) + 
  geom_point()


Changing the color: continuous variable

ggplot(data = ohi_data, aes(x = OHI_score, y = HDI, color = coastal_pop)) + 
  geom_point()


Changing the color: discrete variable

ggplot(data = ohi_data, aes(x = OHI_score, y = HDI, color = georegion_one)) + 
  geom_point()


Changing shape of points

ggplot(data = ohi_data, aes(x = OHI_score, y = HDI, shape = georegion_one)) + 
  geom_point()


Adding labels

#This doesn't add the labels like it seems like it should:
ggplot(data = ohi_data, aes(x = OHI_score, y = HDI, label=country)) + 
  geom_point(aes(x = OHI_score, y = HDI)) 

# To do this we have to add a geom_text function
ggplot(data = ohi_data, aes(x = OHI_score, y = HDI, label=country)) + 
  geom_point(aes(x = OHI_score, y = HDI)) +
  geom_text()

5.8.3 Step 3: Customize your plot

So far, the plots we have created are fairly ugly and hard to read. We will now improve these basic plots using a stepwise approach.

5.8.3.1 Themes

I do not like the default ggplot2 figures because I find them too busy. A quick way to improve plot appearance is to use themes. Many themes are built into the ggplot2 package. For example, theme_bw() removes the gray background. Once you start typing theme_ a list of options will pop up.

ggplot(data = ohi_data, aes(x = OHI_score, y = HDI)) + 
  geom_point() + 
  theme_bw()


The ggthemes package provides many additional themes (including a Tufte theme..which is very clean and data oriented). The ggplot2 extensions website provides a list of packages that extend the capabilities of ggplot2, including additional themes.

I often create my own themes to make figures that work well in publications or presentations. This also gives my figures a consistent look (without having to remember from figure to figure the size of labels, etc.). I have found that storing my themes on Github works well. Here is an example of a theme I created named “scatterTheme”:

source('https://raw.githubusercontent.com/OHI-Science/ohiprep/master/src/R/scatterTheme.txt')       

ggplot(data = ohi_data, aes(x = OHI_score, y = HDI)) + 
  geom_point() + 
  scatterTheme

5.8.3.2 Labels: axis, plot, legend

One of the first things you will often want to do is alter labels for: titles, axes, figure legends, etc. These modifications involve manipulating the theme function and can quickly get complicated, and I typically have to Google the specifics or refer to a cheatsheet, despite using ggplot2 for years and years and years.

But, some key changes are easy to make using the labs function:

ggplot(data = ohi_data, aes(y = OHI_score, x = HDI, color=georegion_one)) + 
  geom_point() + 
    labs(y = "OHI score, 2017",
       x = "Human Development Index",
       title = "Countries with high human development have more sustainable oceans",
      color = "Georegion") +  # if color doesn't work, use "fill"
     theme_bw()




We interrupt this section on customizing your plot for an exercise (5 min)

  1. Make a histogram of the OHI_score variable and color by the georegion_one variable.
  2. How would you make all the bars on your histogram light gray? Hint: use argument fill = "lightgray". Where is the best place to add this in your code to get this to work?
  3. Play with some themes and customizing title and axes labels. Try changing the text sizes and angles (refer to the cheatsheet).


Answers (no peeking)

# 1. This one is tricky because the color aesthetic only does the outline of the bars.
# If you got that far, you were halfway there.  
# You will need to use the fill argument to color shapes 

my_plot <- ggplot(data = ohi_data, aes(x=OHI_score, fill=georegion_one)) +
  geom_histogram()  
my_plot


# 2. This was something that confused me for a while.  My instinct was to add it to the aes function....BUT NO!!!
# The arguments in aes should correspond to a column in the dataframe.  The best place for this argument is in 
# the geom function (and not in an aes function)
ggplot(data = ohi_data, aes(x=OHI_score)) +
  geom_histogram(fill="lightgray")  


# 3.
my_plot +    # the plot created in question 1 continued...
  labs(x = "OHI score",
       y = "Number of countries",
       title = "Distribution of OHI scores") +
  theme_light() +
  theme(legend.title = element_blank(),
        axis.text.x = element_text(angle = 45, hjust = 1, size = 14),
        axis.text.y = element_text(size = 14),
        axis.title = element_text(size = 16)
        )

STOP: commit, pull and push to github

We will now continue with customizing your plot…..



5.8.3.3 Global changes to plot attributes

It is easy to make global changes to your plot’s appearance (these go outside the aes function). Here are some options:

  • color color of lines/points
  • fill color within polygons
  • label if points are a character
  • linetype type of line
  • shape style of point
  • alpha transparency (0-1)
  • size size of shape

Here are the R shapes with their reference numbers:


Here are the R line types with their reference numbers:


With this information, we can improve the appearance of one of our previous figures:

ggplot(data = ohi_summary, aes(x = georegions, y = OHI_score_average)) + 
  geom_bar(stat="identity", fill = "lightgray") +
  geom_jitter(data=ohi_data, aes(x=georegion_one, y=OHI_score), color="red", size=3, alpha=0.3) +
  theme_bw()

5.8.3.4 Color

One of the more challenging aspects of creating a good plot is selecting colors. We are here to help!

Hint 1 Unless you are doing something very simple (e.g. 1-3 colors), I recommend using established color palettes. There are many color palette packages (here is a good resource: https://github.com/EmilHvitfeldt/r-color-palettes), but one of the best known is RColorBrewer.

Here we will explore using the RColorBrewer palettes.

Install the RColorBrewer and colorspace packages (only needs to be done once). Both RColorBrewer and colorspace have nice palettes and colorspace has additional functions for dealing with color.

install.packages("RColorBrewer")
install.packages("colorspace")

Load packages into working space:

library("RColorBrewer")
library("colorspace")

To see the available palettes in RColorBrewer:

display.brewer.all()

To select a palette:

my_palette <- brewer.pal(n=9, "YlOrRd")

Hint 2 R uses hexidecimal to represent colors. Hexadecimal is a base-16 number system used to describe color. Red, green, and blue are each represented by two characters (#rrggbb). Each character has 16 possible symbols: 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F

“00” can be interpreted as 0.0 and “FF” as 1.0, i.e., red= #FF0000 , black=#000000, white = #FFFFFF

Two additional characters (with the same scale) can be added to the end to describe transparency (#rrggbbaa)

A color palette is actually just a simple vector of hexidecimal values:

my_palette
## [1] "#FFFFCC" "#FFEDA0" "#FED976" "#FEB24C" "#FD8D3C" "#FC4E2A" "#E31A1C"
## [8] "#BD0026" "#800026"

I always use hexidecimal format for colors in R because it is the most direct approach.

Hint 3 The function you will want to use to specify color in ggplot2 depends on whether the color is mapped to a discrete/categorical data variable, or a continuous variable. There are a LOT of options in ggplot2, but these are the approaches I have settled on because they are the most flexible:


In the following case we map color to a continuous variable, so we will use the scale_color_gradientn function with the palette we selected above:

ggplot(data = ohi_data, aes(x = OHI_score, y = OHI_trend, color = HDI)) + 
  geom_point(size =3) +
  scale_colour_gradientn(colors = my_palette)


This function takes a vector of colors and interpolates between the colors to get a gradient.

If we are mapping color to a discrete variable, we will use the scale_color_manual function:

# lets use a discrete color scale
my_palette <- brewer.pal(n=12, "Set3")

ggplot(data = ohi_data, aes(x = OHI_score, y = HDI, color = georegion_one)) + 
  geom_point(size = 3) +
  scale_color_manual(values = my_palette)

# Note the first 7 of the 12 colors are used in the plot

The scale_color_manual function also has a lot of great arguments that allow you to control which colors are associated with each factor level, the names used in the legend, and other controls.

Hint 4 If the “color” functions aren’t working, try the “fill” version of the function.



Challenge

With all of this information in hand, please take a few minutes to create a plot using the fake dataset we will create below (or one of the other datasets we have worked with).

# making a fake dataframe
fake_data <- data.frame(animal = rep(c("cat", "dog", "hamster"), each=10),
                        year = 2011:2020,
                        values = c(rnorm(n=10, 5, 1) * seq(0.1, 0.5, length.out=10),
                                   rnorm(n=10, 8, 1) * seq(0.1, 0.5, length.out=10),
                                   rnorm(n=10, 10, 1) * seq(0.1, 0.5, length.out=10)))

## Add your code to create a plot:

Answers (no peeking!)

Here is one approach, but there is a lot more to do to make this really nice!

library(ggthemes)

ggplot(data = fake_data, aes(x = as.factor(year), y = values, group=animal, color=animal)) + 
      geom_point(size = 3) +
      geom_line(size=2, alpha = 0.5) + 
      labs(x = "year", color = "") +
      theme_tufte()


5.9 Saving plots

After creating your plot, use the ggsave() function to save your plot. This function allows you easily change the dimension, resolution, and format of your plot:

my_plot <- ggplot(data = fake_data, aes(x = as.factor(year), y = values, group=animal, color=animal)) + 
      geom_point(size = 3) +
      geom_line(size=2, alpha = 0.5) + 
      labs(x = "year", color = "") +
      theme_tufte()

ggsave("name_of_file.png", my_plot, width = 15, height = 10, dpi=300)


Note: The parameters width and height also determine the font size in the saved plot.

5.10 Arranging plots

I am a huge fan of cowplots for making multiplot figures. Here is a nice introduction.

install.packages("cowplot")
library(cowplot)
## Error in library(cowplot): there is no package called 'cowplot'
score_vs_trend <- ggplot(data=ohi_data, aes(x=OHI_score, y=OHI_trend)) +
  geom_point(size=3, alpha=0.4)

score_vs_trend  # notice that the default theme has been changed....I really like this theme!

score_vs_HDI <- ggplot(data=ohi_data, aes(x=OHI_score, y=HDI)) +
  geom_point(size=3, alpha=0.4) + 
  geom_smooth()

plot_grid(score_vs_trend, score_vs_HDI, labels = c('A', 'B'))
## Error in plot_grid(score_vs_trend, score_vs_HDI, labels = c("A", "B")): could not find function "plot_grid"

5.11 Interactive plots

So as you can see, ggplot2 is a fantastic package for visualizing data. But there are some additional packages that let you make plots interactive. plotly, gganimate. I use plotly all the time to check my data!

install.packages("plotly")
library(plotly)
## Error in library(plotly): there is no package called 'plotly'
score_vs_HDI <- ggplot(data=ohi_data, aes(x=OHI_score, y=HDI, text=paste0("Country: ", country))) +
  geom_point(size=3, alpha=0.4)

ggplotly(score_vs_HDI)
## Error in ggplotly(score_vs_HDI): could not find function "ggplotly"

5.12 Save and push to GitHub