Corinne Riddell
June 18, 2019
dplyr
commands from the first sessionggplot2
packageThese data are partial results from a study that I did on the difference in life expectancy between non-Hispanic Black and White men and women in the United States over time.
A subset of the results have been stored in the Data/ folder as a CSV file.
Do you remember which function to use to import CSV data into R?
readr
’s read_csv()
to import these datalibrary(readr) #readr is part of the tidyverse
le_data <- read_csv("../Data/Life-expectancy-by-state-long.csv")
## Parsed with column specification:
## cols(
## state = col_character(),
## stabbrs = col_character(),
## year = col_double(),
## sex = col_character(),
## Census_Region = col_character(),
## Census_Division = col_character(),
## LE = col_double(),
## race = col_character()
## )
Function 1
head(le_data)
## # A tibble: 6 x 8
## state stabbrs year sex Census_Region Census_Division LE race
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Alabama AL 1969 Female South East South Central 75.8 white
## 2 Alabama AL 1969 Male South East South Central 66.6 white
## 3 Alabama AL 1970 Female South East South Central 75.9 white
## 4 Alabama AL 1970 Male South East South Central 66.7 white
## 5 Alabama AL 1971 Female South East South Central 76.2 white
## 6 Alabama AL 1971 Male South East South Central 66.9 white
Function 2
dim(le_data)
## [1] 7200 8
Function 3
names(le_data)
## [1] "state" "stabbrs" "year" "sex"
## [5] "Census_Region" "Census_Division" "LE" "race"
Function 4
str(le_data)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 7200 obs. of 8 variables:
## $ state : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ stabbrs : chr "AL" "AL" "AL" "AL" ...
## $ year : num 1969 1969 1970 1970 1971 ...
## $ sex : chr "Female" "Male" "Female" "Male" ...
## $ Census_Region : chr "South" "South" "South" "South" ...
## $ Census_Division: chr "East South Central" "East South Central" "East South Central" "East South Central" ...
## $ LE : num 75.8 66.6 75.9 66.7 76.2 ...
## $ race : chr "white" "white" "white" "white" ...
## - attr(*, "spec")=
## .. cols(
## .. state = col_character(),
## .. stabbrs = col_character(),
## .. year = col_double(),
## .. sex = col_character(),
## .. Census_Region = col_character(),
## .. Census_Division = col_character(),
## .. LE = col_double(),
## .. race = col_character()
## .. )
Function 5
View(le_data)
To RStudio!
head()
: prints the first 6 lines of a data framedim()
: prints the # rows and # columnsnames()
: prints the variable namesstr()
: shows the type of each variable and some valuesView()
: opens the viewer pane in RStudioMake a scatter plot of the life expectancy for White men in California over time.
Since the dataset contains 39 states across two genders and two races, first use a function to subset the data to contain only White men in California.
Which function from Malcolm’s lesson do we need?
dplyr
’s filter()
to select a subset of rowslibrary(dplyr)
wm_cali <- le_data %>% filter(state == "California",
sex == "Male",
race == "white")
#this is equivalent:
wm_cali <- le_data %>% filter(state == "California" & sex == "Male" & race == "white")
ggplot()
: set up a canvasdata
set and what goes on the x
and y
axeslibrary(ggplot2)
ggplot(data = wm_cali, aes(x = year, y = LE))
ggplot()
: tell ggplot
how to plot the dataggplot(data = wm_cali, aes(x = year, y = LE)) + geom_point()
geom_point()
tells ggplot to use points to plot these datacol
controls the color of geom_point()ggplot(data = wm_cali, aes(x = year, y = LE)) + geom_point(col = "blue") +
labs(title = "Life expectancy in White men in California, 1969-2013",
y = "Life expectancy",
x = "Year",
caption = "Data from Riddell et al. (2018)")
size
controls the size of geom_point()ggplot(data = wm_cali, aes(x = year, y = LE)) + geom_point(col = "blue", size = 4) +
labs(title = "Life expectancy in White men in California, 1969-2013",
y = "Life expectancy",
x = "Year",
caption = "Data from Riddell et al. (2018)")
What if we wanted to make these data into a line plot instead. What part of the code should change?
ggplot(data = wm_cali, aes(x = year, y = LE)) +
geom_point(col = "blue", size = 4) +
labs(title = "Life expectancy in White men in California, 1969-2013",
y = "Life expectancy",
x = "Year",
caption = "Data from Riddell et al. (2018)")
geom_line()
to make a line plotggplot(data = wm_cali, aes(x = year, y = LE)) + geom_line(col = "blue") +
labs(title = "Life expectancy in White men in California, 1969-2013",
y = "Life expectancy",
x = "Year",
caption = "Data from Riddell et al. (2018)")
What do we need to change to make a separate line for both Black and White men?
filter()
wbm_cali <- le_data %>% filter(state == "California",
sex == "Male")
ggplot(data = wm_cali, aes(x = year, y = LE)) + geom_line(col = "blue") +
labs(title = "Life expectancy in White men in California, 1969-2013",
y = "Life expectancy",
x = "Year",
caption = "Data from Riddell et al. (2018)")
ggplot(data = wbm_cali, aes(x = year, y = LE)) + geom_line(aes(col = race)) +
labs(title = "Life expectancy in Black and White men in California, 1969-2013",
y = "Life expectancy",
x = "Year",
caption = "Data from Riddell et al. (2018)")
The operative word is link. Whenever you want to link something about how the plot looks to a variable in the data frame, you need to link these items inside the aes()
function:
ggplot(data = wbm_cali, aes(x = year, y = LE)) + geom_line(aes(col = race)) +
labs(title = "Life expectancy in Black and White men in California, 1969-2013",
y = "Life expectancy",
x = "Year",
caption = "Data from Riddell et al. (2018)")
aes()
functionaes()
function?aes()
functionaes()
function?
cali_data <- le_data %>% filter(state == "California")
ggplot(data = cali_data, aes(x = year, y = LE)) + geom_line(aes(col = race)) +
labs(title = "Life expectancy in California, 1969-2013",
y = "Life expectancy",
x = "Year",
caption = "Data from Riddell et al. (2018)")
lty()
to link line type to sexggplot(data = cali_data, aes(x = year, y = LE)) + geom_line(aes(col = race, lty = sex)) +
labs(title = "Life expectancy in California, 1969-2013",
y = "Life expectancy",
x = "Year",
caption = "Data from Riddell et al. (2018)")
facet_wrap()
to make separate plots for a specified variableggplot(data = cali_data, aes(x = year, y = LE)) +
geom_line(aes(col = race, lty = sex)) +
labs(title = "Life expectancy in California, 1969-2013",
y = "Life expectancy",
x = "Year",
caption = "Data from Riddell et al. (2018)") +
facet_wrap(~ sex)
How do we update the filter
to include data from California and New York?
updated_data <- le_data %>% filter(state %in% c("California", "New York"))
#to fill in during class
ggplot(data = updated_data, aes(x = year, y = LE)) +
geom_line(aes(col = race, lty = sex)) +
labs(title = "Life expectancy in California and New York, 1969-2013",
y = "Life expectancy",
x = "Year",
caption = "Data from Riddell et al. (2018)") +
facet_grid(state ~ sex)
What is the difference between facet_wrap()
and facet_grid()
?
geom_point()
to make scatter plotsgeom_line()
to make line plotscol = "blue"
, size = 2
, lty = 2
, to change color, size and line type of the geom
aes(col = race)
to link color to raceaes(lty = sex)
to link line type to sexfacet_wrap(~ var1)
to make separate plots for different levels of one variablefacet_grid(var1 ~ var2)
to make separate plots for combinations of levels of two variables…of life expectancy of White men in 2013?
Before you code, try and visualize what the histogram will show
filter
wm_data <- le_data %>% filter(year == 2013, sex == "Male", race == "white")
geom_histogram()
to make histogramsggplot(dat = wm_data, aes(x = LE)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
fill
to change the fill of the histogram and binwidth
to specify the bin widthwm_data <- le_data %>% filter(year == 2013, sex == "Male", race == "white")
ggplot(dat = wm_data, aes(x = LE)) +
geom_histogram(binwidth = 1, col = "white", fill = "forest green")
data_2013 <- le_data %>% filter(year == 2013)
ggplot(dat = data_2013, aes(x = LE)) +
geom_histogram(binwidth = 1, col = "white", aes(fill = sex)) +
facet_grid(race ~ sex)
ggplot()
geom_scatter()
geom_line()
geom_histogram()
aes()
to link aesthetics to variables in our data framefacet_wrap(~ var1)
, facet_grid(var1 ~ var2)
labs(title = "Main", y = "y axis", x = "x axis", caption = "below plot")
ggplot()
col
size
lty
ggplot
works, but you might be itching to learn more.