R Workshop: Geographic and Demographic Data Analysis and Visualization

Author

Michelle Bueno Vásquez

Published

April 28, 2025

Welcome! 🗺️

Setting up a Census API Key

Before we get started, I will have you all go ahead and request a Census API key which will allow us to access Census data for our analyses. It takes about 2-5 minutes to send, so let’s get started with this while we chat a little about spatial data 😊

Step-by-Step Instructions for Getting a Census API Key for `tidycensus`

The U.S. Census Bureau provides an API that allows users to access various datasets, including the American Community Survey (ACS) and the Decennial Census. To use tidycensus in R, you’ll need to obtain and register an API key.

Request a Census API Key

Go to the U.S. Census API Key Request Page:
Fill out the form:
- Enter your name
- Enter your email address
- Agree to the terms of service
Submit the request.
You will receive an API key via email (a long alphanumeric string).

Introduction to Geospatial Data in R

What is Geospatial Data?

Geospatial data refers to any data that has a geographic component, meaning it is tied to specific locations on the Earth’s surface. This data is used in mapping, spatial analysis, and geographic visualization.

There are two primary types of geospatial data:
1. Vector Data: Represents geographic features as points, lines, or polygons.
Examples:
- Points: Locations of schools, stores, or crime incidents
- Lines: Roads, rivers, or flight paths
- Polygons: State boundaries, land parcels, Census tracts

Raster Data: Represents spatial data as a grid of pixels, commonly used for continuous data like elevation, temperature, and satellite imagery.
- Example: A satellite image of land cover, where each pixel represents vegetation, water, or urban areas.

Common Geospatial Data Formats

Different formats are used to store and exchange geospatial data. Some of the most common include:

Shapefiles (.shp) – A widely used vector format consisting of multiple files that store geometry and attribute data.
GeoJSON (.geojson) – A JSON-based format for encoding spatial data, commonly used for web mapping.
KML (.kml) – A format developed by Google for geographic visualization in Google Earth and Maps.
Census Data (via API or TIGER/Line Shapefiles) – Demographic and boundary data provided by the U.S. Census Bureau.

Overview of Key R Packages for Geospatial Data

R provides powerful packages for working with geospatial data. Here are the key ones we’ll use in this workshop:

`sf` (Simple Features)

Provides a modern approach to handling vector geospatial data in R.
Replaces older packages like sp and rgdal.
Supports reading, writing, and manipulating spatial data.

# Install the package `sf`
# install.packages("sf")

# Load the package
library(sf)

`tigris`

Retrieves geographic boundary data from the U.S. Census Bureau (e.g., states, counties, tracts).
Works with sf for mapping and spatial analysis.

Example: Download state boundaries

# install.packages('tigris')
library(tigris)

# We can download boundary data for the US states
states_sf <- states(cb = TRUE,
                    resolution = "20m",
                    progress_bar = FALSE)

`maps`

Provides simple built-in maps for U.S. states, counties, and world boundaries.
Useful for quick visualizations.

Example: Plot a basic map of the U.S.

# Install 'maps'
# install.packages('maps')

# Load maps alone with tidyverse to access ggplot
library(tidyverse)
library(maps)

# Download U.S. map data
us_states <- map_data("state")

# Plot the U.S. map
ggplot(us_states, 
       aes(long, lat, 
           group = group)) + 
  geom_polygon(fill = "lavender", 
               color = "black") +
  theme_void()

This is great, but we’re missing Alaska, Hawaii, Puerto Rico, etc. We can use tigris instead for the same purposes.

`ggplot2`’s `geom_sf` for Geospatial Visualization

geom_sf() allows for powerful mapping of spatial data.
Enables custom styling and integration with non-spatial data.

# Let's plot this using the `sf` object from tigris
ggplot(states_sf) +
    geom_sf(fill = "lavender", color = "black")

But notice how our map looks very tiny and spread out. 😱: That is because the shape file is using raw longitude and latitude. Since this plot is very hard to read, we can use a transformation to shift the non-continental U.S. territories near the rest of the U.S.:

# We'll use the `tigris` function `shift_geometry` when saving the states file
states_sf <- states(cb = TRUE,
                    resolution = "20m") %>% 
  shift_geometry()

# Now we can try plotting again:
ggplot(states_sf) +
    geom_sf(fill = "lavender", 
            color = "black")

I like to add theme_void to get rid of those coordinate lines:

# We'll use the `tigris` function `shift_geometry` when saving the states file
states_sf <- states(cb = TRUE,
                    resolution = "20m") %>% 
  shift_geometry()

# Now we can try plotting again:
ggplot(states_sf) +
    geom_sf(fill = "lavender", 
            color = "black") +
  theme_void()

Summary

Geospatial data is any data with a geographic component, stored in vector (points, lines, polygons) or raster (grids) formats.
Common formats include Shapefiles, GeoJSON, KML, and Census data.
Key R packages:
- sf for spatial data manipulation.
- tidycensus for retrieving Census data.
- tigris for downloading geographic boundaries.
- maps for built-in map data.
- ggplot2 for visualization.

Exercise: plot your home state and city

I’ll do the same for Illinois and Chicago

# get the data for IL counties
# get data
il_counties <- map_data("county", "illinois") %>% 
  select(lon = long, lat, group, id = subregion)


# Chicago and Evanston lon and lat (looked it up on Google)
# Chicago: 41.8781° N, -87.6298° W
# Evanston: 42.0521° N, -87.6848° W
il_city_data <- tibble(
  city_name = c("Chicago", "Evanston"),
  lon = c(-87.6298, -87.6848),
  lat = c(41.8781, 42.0521)
)


# Make map with cities
ggplot(il_counties, aes(lon, lat)) +
  geom_polygon(aes(group = group), 
               fill = "slateblue", 
               col = 'grey50') +
  ggstar::geom_star(data = il_city_data, 
                    size = 4,
                    fill = "gold",
                    mapping = aes(fill = city_name)) +
 geom_label(data = il_city_data, 
            aes(label = city_name), 
            nudge_x = c(-.5, -.8),
            nudge_y = c(-.3, 0),
            size = 3.2) +
  coord_quickmap() +
  theme_void() +
  ggtitle("Illinois: the Lincoln State")

YOUR TURN 🙌 :

Using my code above as a guide, plot your home state/any state of your choosing and its capital and/or your home town.

Using tidycensus for Census Data Retrieval and Visualization

What is tidycensus?

The tidycensus package is an R interface to the U.S. Census Bureau API. It is useful for mapping, analyzing, and visualizing demographic data in R.

It allows users to retrieve demographic and socioeconomic data from:

The American Community Survey (ACS) (yearly estimates)
The Decennial Census (10-year population counts)
Other Census datasets

Here are a few of the commands you can use to access Census Bureau products:

get_acs(), which requests data from the 1-year and 5-year American Community Survey samples. Data are available from the 1-year ACS back to 2005 and the 5-year ACS back to 2005-2009.
get_estimates(), an interface to the Population Estimates APIs. These datasets include yearly estimates of population characteristics by state, county, and metropolitan area, along with components of change demographic estimates like births, deaths, and migration rates.
get_flows(), an interface to the ACS Migration Flows APIs. Includes information on in- and out-flows from various geographies for the 5-year ACS samples, enabling origin-destination analyses.
get_pums(), which accesses data from the ACS Public Use Microdata Sample APIs. These samples include anonymized individual-level records from the ACS organized by household and are highly useful for many different social science analyses.¹

Install and Load `tidycensus` in R

You should have received a Census API key to the email you entered earlier.

# Install the `tidycensus`
# install.packages("tidycensus")

# Load the package
library(tidycensus)

### Register API key 
## Option 1: temporary, save it only for this session 
## (delete # in front of the following line when ready to run code)

# census_api_key("YOUR_API_KEY_HERE")

# Option 2 (my recommendation): Save the API Key for Future Use 
## to avoid entering the key every time, store it permanently: 
## (delete # in front of the following line when ready to run code)

# census_api_key("YOUR_API_KEY_HERE", install = TRUE)

Replace "YOUR_API_KEY_HERE" with the key you received.

Option 2 will save the key to your .Renviron file, so it loads automatically in future R sessions. Restart R for the change to take effect.

Getting ACS Data

The ACS provides yearly estimates on income, education, race, and other demographics. Use get_acs() to retrieve ACS data.

Example: Get median household income by state for 2024

Let’s get some ACS data and test the API Key

To verify that the key is working, try fetching some ACS (American Community Survey) data.

First, let’s load the variables in the ACS for the year we’d like to gather data from. We use the call load_variables

We can also find them listed on the Census website

# Load variable list
vars23 <- load_variables(2023, "acs5", cache = TRUE)

# We can take a look and search for variables of interest
view(vars23)

But as you can see, it can be overwhelming to search. Here is a trick to quicken our search:

# Let's filter all variables with 'median household income' in the label column
income_vars <- vars23 %>% 
    filter(grepl('median household income', label, ignore.case = TRUE))

view(income_vars)

The variable B19013_001 with label Estimate!!Median household income in the past 12 months (in 2023 inflation-adjusted dollars) and concept Median Household Income in the Past 12 Months (in 2023 Inflation-Adjusted Dollars) looks perfect for our needs.

Now that we know our variable of interest, let’s define our ACS call by state for 2023:

# Using the get_acs function, we can get Census data for specific geogprahies and specific variables
df <- get_acs(geography = "state", 
              variables = "B19013_001",  # Median Household Income
              year = 2023,
              progress_bar = FALSE)

# Let's take a sneak peak of our data frame
head(df)

# A tibble: 6 × 5
  GEOID NAME       variable   estimate   moe
  <chr> <chr>      <chr>         <dbl> <dbl>
1 01    Alabama    B19013_001    62027   400
2 02    Alaska     B19013_001    89336  1374
3 04    Arizona    B19013_001    76872   414
4 05    Arkansas   B19013_001    58773   503
5 06    California B19013_001    96334   298
6 08    Colorado   B19013_001    92470   483

If the data loads without errors, your API key is successfully set up!

The dataset includes:

GEOID (geographic ID)
NAME (state name)
estimate (median income value)
moe (margin of error)

Combining Census Data with Geospatial Data

We can skip a step by requesting geographic data simultaneously with Census data, using the argument geometry = TRUE

income_sf <- get_acs(geography = "state", 
                     variables = c(median_income = "B19013_001"), 
                     year = 2023, 
                     geometry = TRUE,
                     progress_bar = FALSE) %>% # Get spatial data
              shift_geometry()            # Don't forget or else we get a weird map!

Now we can create different types of visualizations

# Install viridis color package
# install.packages('viridis')
library(viridis)

# Plot
ggplot(income_sf) +
  geom_sf(aes(fill = estimate), 
          color = "white") +
  scale_fill_viridis_c(option = "magma", 
                       name = "Median Income") +
  theme_void() +
  labs(title = "Median Household Income by State (2023)",
       caption = "Source: U.S. Census Bureau ACS 2023\n*note: Puerto Rico not included")

Key Features of this Plot:

aes(fill = estimate) colors states by income
scale_fill_viridis_c() improves color contrast
theme_minimal() gives a clean look

Okay, the map looks good but that scaling on the legend is kind of ugly.

Let’s fix that:

# Install and load the 'scales' library
# install.packages('scales')
library(scales)

# Plot with pretty numbers
options(scipen = 999)        # this will prevent the plots from producing a default scientific notation 

ggplot(income_sf) +
  geom_sf(aes(fill = estimate), 
          color = "white") +
  scale_fill_viridis_c(option = "magma", 
                       name = "Median Income",
                       labels = scales::comma) +    # this will add  a comma for thousands
  theme_void() +
  labs(title = "Median Household Income by State (2023)",
       caption = "Source: U.S. Census Bureau ACS 2023\n*note: Puerto Rico not included")

Top states by income

top_income <- income_sf %>%
  arrange(desc(estimate)) %>%
  slice(1:10)

ggplot(top_income, aes(x = reorder(NAME, estimate), y = estimate)) +
  geom_col(fill = "steelblue") +
  coord_flip() +  # Flip for readability
  theme_minimal() +
  scale_y_continuous(labels = scales::comma) +       # same scale trick, but note we use `scale_y_continuous` since this time we are targeting the x-axis
  labs(title = "Top 10 States by Median Income (2023)",
       x = "State", y = "Median Income",
       caption = "Source: U.S. Census Bureau ACS 2023")

We can also use scatter plots along with maps to display population

For this, gathering population per county and mapping it onto states may be best

# Get county-level population data
county_pop <- get_acs(
  geography = "county",
  variables = "B01003_001",  # Total population variable
  year = 2023,
  geometry = TRUE,
  progress_bar = FALSE
) %>% 
  shift_geometry()

# Extract centroid coordinates for plotting as points
county_pop <- county_pop %>%
  mutate(
    lon = st_coordinates(st_centroid(geometry))[,1],  # Longitude
    lat = st_coordinates(st_centroid(geometry))[,2]   # Latitude
  )  
#  filter(estimate >= 50000) # only keep counties with more than 50k people

# Check data structure
head(county_pop)

Simple feature collection with 6 features and 7 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 139720.3 ymin: -780463.1 xmax: 1044933 ymax: -368685.2
Projected CRS: USA_Contiguous_Albers_Equal_Area_Conic
  GEOID                    NAME   variable estimate moe
1 01003 Baldwin County, Alabama B01003_001   239945  NA
2 01069 Houston County, Alabama B01003_001   107628  NA
3 01005 Barbour County, Alabama B01003_001    24757  NA
4 01119  Sumter County, Alabama B01003_001    12020  NA
5 05091 Miller County, Arkansas B01003_001    42588  NA
6 05133 Sevier County, Arkansas B01003_001    15797  NA
                        geometry       lon       lat
1 MULTIPOLYGON (((765298 -780...  789795.1 -722243.2
2 MULTIPOLYGON (((975268 -651... 1014495.8 -652117.0
3 MULTIPOLYGON (((966818.1 -6...  997016.3 -573843.3
4 MULTIPOLYGON (((709323.3 -5...  727423.5 -519270.6
5 MULTIPOLYGON (((180424 -440...  194967.8 -466481.6
6 MULTIPOLYGON (((139720.4 -3...  161278.0 -390636.1

Aggregate Population Data to State Level

state_pop <- county_pop %>%
  group_by(GEOID = substr(GEOID, 1, 2)) %>%  # Use first 2 digits of GEOID (State FIPS)
  summarise(total_population = sum(estimate, na.rm = TRUE))

# Load state geometries
states_sf <- get_acs(
  geography = "state",
  variables = "B01003_001",  # Total population variable
  year = 2023,
  geometry = TRUE,
  progress_bar = FALSE
) %>% 
  shift_geometry()

Now we plot:

ggplot() +
  # Draw state boundaries
  geom_sf(data = states_sf, fill = "gray90", color = "white", size = 0.3) +

  # Overlay county population as proportional dots using geom_point()
  geom_point(data = county_pop, aes(x = lon, y = lat, size = estimate), 
             color = "coral", alpha = 0.2) +

  # Scale dot sizes to be visually proportional
  scale_size_continuous(
    range = c(1, 10),  # Adjust min/max dot sizes
    labels = comma,  # Format population labels with commas
    name = "County Population"
  ) +

  # Titles and theme
  theme_void() +
  labs(
    title = "County Population Distribution in the U.S.",
    subtitle = "Larger circles represent higher county populations",
    caption = "Source: U.S. Census Bureau ACS 2022"
  )

We can also adjust color to the population size:

# Plot
ggplot() +
  # Draw state boundaries
  geom_sf(data = states_sf, 
          fill = "gray70", 
          color = "white", 
          size = 0.3) +

  # Overlay county population as proportional dots, with adjusted transparency
  geom_point(data = county_pop, 
             aes(x = lon, 
                 y = lat, 
                 size = estimate, 
                 color = estimate, 
                 alpha = estimate)) +

  # Adjust alpha (transparency): Larger values = more opaque
  scale_alpha_continuous(range = c(0.4, 1), 
                         guide = "none") +  # Avoid cluttering the legend

  # Unified legend for size and color (ordered ascending)
  scale_size_area(
    max_size = 10,  # Adjust max dot size
    labels = comma,  # Format population labels with commas
    name = "County Population",
    guide = guide_legend(order = 1)  # Ensure size legend comes first
  ) +
  scale_color_viridis_c(
    #option = "mako",  # Color scheme (magma = best for ordered data)
    labels = comma,  # Format labels
    direction = -1,   # reverse so darker = larger population
    name = "County Population",
    guide = guide_colorbar(order = 2)  # Ensure color legend comes second
  ) +

  # Titles and theme
  theme_void() +
  labs(
    title = "County Population Distribution (50K+ Population)",
    subtitle = "Larger and darker circles represent more populous counties (more opaque)",
    caption = "Source: U.S. Census Bureau ACS 2022"
  )

YOUR TURN 🙌 :

Choose another variable of interest and gather Census ACS data & plot it using the above code as a guide

Common ACS Variables

Here are a few frequently used variables:

Variable Name	Description
B19013_001	Median Household Income
B17001_002	Population Below Poverty Level
B25077_001	Median Home Value
B01003_001	Total Population
B03002_003	Non-Hispanic White Population
B03002_004	Black Population
B03002_012	Hispanic or Latino Population

Here are other examples of useful variables:

Demographic Analysis (Population by Race, Ethnicity, Age, Gender)

🔸 Use Case: Population distributions, racial/ethnic composition, gender breakdowns.
Variable Name	Description
B01003_001	Total Population
B01002_001	Median Age
B01001_002	Male Population
B01001_026	Female Population
B03002_003	Non-Hispanic White Population
B03002_004	Black or African American Population
B03002_012	Hispanic or Latino Population
B03002_006	Asian Population

Socioeconomic Status (SES) & Income Inequality

🔸 Use Case: Economic disparities, income inequality, housing affordability.
Variable Name	Description
B19013_001	Median Household Income
B19001_002 – B19001_017	Household Income Brackets
B17001_002	Population Below Poverty Level
B19301_001	Per Capita Income
B25077_001	Median Home Value
B25064_001	Median Gross Rent

Education & Employment

🔸 Use Case: Workforce analysis, educational attainment, economic mobility.
Variable Name	Description
B15003_022	Bachelor’s Degree or Higher
B15003_017	High School Graduates
B23025_002	Civilian Labor Force
B23025_005	Unemployment Count
B23025_007	Unemployment Rate

Housing & Urban Development

🔸 Use Case: Gentrification studies, housing affordability, displacement trends.
Variable Name	Description
B25077_001	Median Home Value
B25064_001	Median Gross Rent
B25034_001	Housing Units by Year Built
B25002_003	Vacant Housing Units
B25024_002	Single-Family Homes
B25024_010	Mobile Homes

Health & Disability

🔸 Use Case: Healthcare access, disability prevalence, environmental health.
Variable Name	Description
B18101_001	Population with a Disability
B27010_001	Population with Health Insurance Coverage
B27010_017	Population Uninsured
B25099_001	Households with No Plumbing Facilities

Using state bins

This is another approach for data visualization of state data where we ditch the state’s natural shape to emphasize comparisons among them.

# Install and load 'statebins'
# install.packages('statebins')
library(statebins)


# Using the income dataframe from earlier, let's add income levels
US_income <- mutate(
  income_sf,
  income_bins = cut(
    ifelse(is.na(estimate), 25000, estimate),
    breaks = c(0, 40000, 50000, 60000, 70000, 80000, 150000),
    labels = c("< $40k", "$40k to $50k", 
               "$50k to $60k", "$60k to $70k", "$70k - $80k", " > $80k"),
    right = FALSE
  )
)

# Plotting
US_income %>% 
  ggplot() +
  statebins::geom_statebins(aes(geometry = geometry,
                                state = NAME, 
                                fill = income_bins)) +
  scale_fill_viridis_d() +
  theme_statebins() +
  labs(title = "Median Household Income by State (2023)",
       caption = "Source: U.S. Census Bureau ACS 2023. Note: Puerto Rico not included")

Advanced example: using multiple variables at once

# Let's gather data for income and poverty for states in 2023
data <- get_acs(geography = "state", 
                variables = c(income = "B19013_001", 
                              poverty = "B17001_002",
                              population = "B01003_001"),
                year = 2023,
                geometry = TRUE,
                progress_bar = FALSE) %>% 
  shift_geometry()

head(data)

Simple feature collection with 6 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -1233579 ymin: -614973.5 xmax: -35340.83 ymax: 970427.9
Projected CRS: USA_Contiguous_Albers_Equal_Area_Conic
  GEOID         NAME   variable estimate  moe                       geometry
1    35   New Mexico population  2114768   NA MULTIPOLYGON (((-1231344 -5...
2    35   New Mexico    poverty   375381 7817 MULTIPOLYGON (((-1231344 -5...
3    35   New Mexico     income    62125  608 MULTIPOLYGON (((-1231344 -5...
4    46 South Dakota population   899194   NA MULTIPOLYGON (((-633765.6 8...
5    46 South Dakota    poverty   104511 3107 MULTIPOLYGON (((-633765.6 8...
6    46 South Dakota     income    72421  805 MULTIPOLYGON (((-633765.6 8...

Notice how our variables are all listed under the variable column. For our analyses, it would be fruitful to have the variables as their own columns.

To do this, we will pivot the data long-ways so that each variable is its own column.

# Transpose the data so each variable has its own column}
data_wide <- data %>%
  select(GEOID,
         state = NAME, 
         variable, 
         estimate, 
         geometry) %>%  # Keep relevant columns
  distinct(GEOID, 
           state, 
           variable, 
           estimate, 
           .keep_all = TRUE) %>%  # Remove duplicate rows
  pivot_wider(
    names_from = variable,
    values_from = estimate
  ) %>% 
  mutate(poverty_rate = (poverty / population) * 100)  # Compute Poverty Rate

# Let's look at our new data set:
head(data_wide)

Simple feature collection with 6 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -2356114 ymin: -782290.7 xmax: 1419556 ymax: 970427.9
Projected CRS: USA_Contiguous_Albers_Equal_Area_Conic
# A tibble: 6 × 7
  GEOID state                    geometry population poverty income poverty_rate
  <chr> <chr>          <MULTIPOLYGON [m]>      <dbl>   <dbl>  <dbl>        <dbl>
1 35    New Me… (((-1231344 -588235.8, -…    2114768  375381  62125         17.8
2 46    South … (((-633765.6 865552.6, -…     899194  104511  72421         11.6
3 06    Califo… (((-2066923 -203083.1, -…   39242785 4610600  96334         11.7
4 21    Kentuc… (((584560 -88603.17, 585…    4510725  707480  62417         15.7
5 01    Alabama (((760323.7 -749311.5, 7…    5054253  768185  62027         15.2
6 13    Georgia (((1390722 -584139.2, 13…   10822590 1423159  74664         13.1

# Install package`RColorBrewer`
# install.packages('RColorBrewer')

library(RColorBrewer)  # For color scales

# Define bivariate color scale using ColorBrewer
bivariate_palette <- c(
  "Low-Low" = brewer.pal(9, "PuBuGn")[1],  # Light green
  "Low-Medium" = brewer.pal(9, "PuBuGn")[3],
  "Low-High" = brewer.pal(9, "PuBuGn")[5],
  "Medium-Low" = brewer.pal(9, "Purples")[3],
  "Medium-Medium" = brewer.pal(9, "Purples")[5],
  "Medium-High" = brewer.pal(9, "Purples")[7],
  "High-Low" = brewer.pal(9, "Oranges")[3],
  "High-Medium" = brewer.pal(9, "Oranges")[5],
  "High-High" = brewer.pal(9, "Oranges")[7]   # Dark orange
)


# Assign colors to the dataset
data_wide <- data_wide %>%
  mutate(
    income_cat = cut(income, 
                     breaks = quantile(income, 
                                       probs = seq(0, 1, length.out = 4), 
                                       na.rm = TRUE), 
                     labels = c("Low", "Medium", "High"), 
                     include.lowest = TRUE),
    poverty_cat = cut(poverty_rate, 
                      breaks = quantile(poverty_rate, 
                                        probs = seq(0, 1, length.out = 4), 
                                        na.rm = TRUE), 
                      labels = c("Low", "Medium", "High"), 
                      include.lowest = TRUE),
    category = paste(income_cat, 
                     poverty_cat, 
                     sep = "-")  # Create bivariate categories
  )

How This Works

Each state (or region) falls into one of nine possible categories based on:

Income level: Low, Medium, or High
Poverty rate: Low, Medium, or High

Each category is assigned a unique color using three different ColorBrewer palettes:

PuBuGn (Blue-Green shades) → Used for Low Income
Purples (Purple shades) → Used for Medium Income
Oranges (Orange shades) → Used for High Income

Darker colors indicate higher poverty rates, making the map intuitive:

“Low-Low” (Low income, Low poverty) → Lightest Green
“Low-High” (Low income, High poverty) → Darker Green
“High-Low” (High income, Low poverty) → Light Orange
“High-High” (High income, High poverty) → Darkest Orange

Breakdown of Color Assignments

I chose the colors using the link for the Brewer Color Palettes.

Category (Income-Poverty)	Color Palette Used	Shade
Low-Low (Low income, Low poverty)	PuBuGn (Blue-Green)	Lightest Green ([1])
Low-Medium (Low income, Medium poverty)	PuBuGn	Medium Green ([3])
Low-High (Low income, High poverty)	PuBuGn	Darker Green ([5])
Medium-Low (Medium income, Low poverty)	Purples	Light Purple ([3])
Medium-Medium (Medium income, Medium poverty)	Purples	Medium Purple ([5])
Medium-High (Medium income, High poverty)	Purples Darker Purple ([7])
High-Low (High income, Low poverty)	Oranges	Light Orange ([3])
High-Medium (High income, Medium poverty)	Oranges	Medium Orange ([5])
High-High (High income, High poverty)	Oranges	Darkest Orange ([7])

# Create plot

ggplot(data_wide) +
  geom_sf(aes(fill = category), 
          color = "white", 
          size = 0.2) +
  scale_fill_manual(values = bivariate_palette, 
                    name = "Income-Poverty Scale") +  # Add legend
  theme_void() +
  labs(
    title = "Income & Poverty Rate by U.S. State (2023)",
    subtitle = "Bivariate Choropleth Map",
    caption = "Source: U.S. Census Bureau ACS 2023"
  )

Using Census data with other data sources

Census data can be used alongside other political data sources (such as ANES, election data, or even your own survey or qualitative data)!

Let’s use the income-poverty data and compare it to red vs blue states.

First let’s make a data frame with the states and their respective 2024 vote

# Using data from Politico on 2024 Presidential election

#red_states ~ "#fb5c44"
#blue_states ~ "#336fba"

state_colors <- c("Idaho" = "#fb5c44",
                "Montana" = "#fb5c44",
                "North Dakota" = "#fb5c44",
                "South Dakota" = "#fb5c44",
                "Wyoming" = "#fb5c44",
                "Utah" = "#fb5c44",
                "Nebraska" = "#fb5c44",
                "Kansas" = "#fb5c44",
                "Oklahoma" = "#fb5c44",
                "Texas" = "#fb5c44",
                "Louisiana" = "#fb5c44",
                "Mississippi" = "#fb5c44",
                "Alabama" = "#fb5c44",
                "Iowa" = "#fb5c44",
                "Missouri" = "#fb5c44",
                "Arkansas" = "#fb5c44",
                "Tennessee" = "#fb5c44",
                "Kentucky" = "#fb5c44",
                "Indiana" = "#fb5c44",
                "Ohio" = "#fb5c44",
                "Nevada" = "#fb5c44",
                "Arizona" = "#fb5c44",
                "Pennsylvania" = "#fb5c44",
                "Georgia" = "#fb5c44",
                "Michigan" = "#fb5c44",
                "Wisconsin" = "#fb5c44",
                "West Virginia" = "#fb5c44",
                "North Carolina" = "#fb5c44",
                "South Carolina" = "#fb5c44",
                "Florida" = "#fb5c44",
                "Alaska" = "#fb5c44",
                "Washington" = "#336fba",
                 "Oregon" = "#336fba",
                 "California" = "#336fba",
                 "Colorado" = "#336fba",
                 "New Mexico" = "#336fba",
                 "Minnesota" = "#336fba",
                 "Massachusetts" = "#336fba",
                 "Illinois" = "#336fba",
                 "Virginia" = "#336fba",
                 "Maryland" = "#336fba",
                 "District Of Columbia" = "#336fba",
                 "Connecticut" = "#336fba",
                 "Rhode Island" = "#336fba",
                 "Delaware" = "#336fba",
                 "Maine" = "#336fba",
                 "New York" = "#336fba",
                 "New Jersey" = "#336fba",
                 "Hawaii" = "#336fba",
                 "New Hampshire" = "#336fba",
                 "Vermont" = "#336fba"
                 )

color_vote <- as.data.frame(state_colors)

color_vote <- tibble::rownames_to_column(color_vote, "state")

Next, let’s modify the data set to sort by top poverty rate and income, respectively, and make two different plots

# Let's plot poverty rate and income
poverty_colors <- 
  # let's join the data sets while also sorting by top poverty rate
  data_wide %>% 
  inner_join(color_vote, by = "state") %>% 
  filter(state != "Puerto Rico") %>% 
  mutate_if(is.numeric, round, 2) %>% 
  top_n(10, poverty_rate) %>% 
  # Now we can plot
  ggplot(aes(x = reorder(state, poverty_rate), 
               y = poverty_rate,
             fill = state_colors)) +
  geom_col() +
  scale_fill_manual(
    name = "2024 Presidential Election Result",
    labels = c("Blue", "Red"),
    values = c("#336fba","#fb5c44"),
    guide = "none"
  ) +
  coord_flip() +
  geom_text(aes(label = poverty_rate), 
            color = "white",
            hjust = 1.1,
            size = 3) +
  theme_bw()  +
  labs(
    y = "Poverty Rate",
    title = 'Top 10 states by poverty rate\n& 2024 Presidential vote'
  ) +
  theme(
        axis.title.y = element_blank(),
        axis.ticks.x = element_blank())

income_colors <-  
  # let's join the data sets while also sorting by top income
  data_wide %>% 
  inner_join(color_vote, by = "state") %>% 
  filter(state != "Puerto Rico") %>% 
  mutate_if(is.numeric, round, 2) %>% 
  top_n(10, income) %>% 
  # Now we can plot
  ggplot(aes(x = reorder(state, income), 
               y = income,
             fill = state_colors)) +
  geom_col() +
  scale_fill_manual(
    name = "2024 Presidential Election Result",
    labels = c("Blue", "Red"),
    values = c("#336fba","#fb5c44"),
    guide = "none"
  ) +
  scale_y_continuous(labels = scales::comma) +
  coord_flip() +
  geom_text(aes(label = scales::comma(income)), 
            color = "white",
            hjust = 1.1,
            size = 3) +
  theme_bw()  +
  labs(
    y = "Median Household Income",
    title = 'Top 10 states by income\n& 2024 Presidential vote',
    caption = "\nU.S. Census Bureau 2023 ACS 5-Year Population Estimates \n Politico 2024 Election Mapping."
  ) +
  theme(
        axis.title.y = element_blank(),
        axis.ticks.x = element_blank())

Finally, let’s view these plots side by side and see if there are interesting conclusions we can make about the interactions between poverty/income and 2024 vote choice.

# Install patchwork to view these side by side
# install.packages('patchwork')
library(patchwork)

poverty_colors + income_colors

YOUR TURN 🙌 : Using the color_vote data from the 2024 election, find another variable you are interested in to visualize along with vote choice.

Footnotes

Source: Analyzing US Census Data: Methods, Maps, and Models in R by Kyle Walker↩︎

--- title: "R Workshop: Geographic and Demographic Data Analysis and Visualization" author: "Michelle Bueno Vásquez" date: today format: html: toc: true embed-resources: true code-tools: true link-external-newwindow: true fig-dpi: 300 execute: echo: true warning: false message: false cache: TRUE from: markdown+emoji --- # Welcome! :world_map: ::: callout-tip ## Setting up a Census API Key Before we get started, I will have you all go ahead and request a Census API key which will allow us to access Census data for our analyses. It takes about 2-5 minutes to send, so let's get started with this while we chat a little about spatial data :blush: ::: # Step-by-Step Instructions for Getting a Census API Key for `tidycensus` The U.S. Census Bureau provides an API that allows users to access various datasets, including the American Community Survey (ACS) and the Decennial Census. To use `tidycensus` in R, you'll need to obtain and register an API key. ### Request a Census API Key 1. [**Go to the U.S. Census API Key Request Page:**](https://api.census.gov/data/key_signup.html) 2. **Fill out the form:** - Enter your name - Enter your email address - Agree to the terms of service 3. **Submit the request.**\ You will receive an API key via email (a long alphanumeric string). ------------------------------------------------------------------------ # Introduction to Geospatial Data in R ## **What is Geospatial Data?** *Geospatial data* refers to any data that has a geographic component, meaning it is tied to specific locations on the Earth's surface. This data is used in mapping, spatial analysis, and geographic visualization. There are two primary types of geospatial data:\ 1. **Vector Data**: Represents geographic features as points, lines, or polygons.\ Examples:\ - **Points**: Locations of schools, stores, or crime incidents\ - **Lines**: Roads, rivers, or flight paths\ - **Polygons**: State boundaries, land parcels, Census tracts 2. **Raster Data**: Represents spatial data as a grid of pixels, commonly used for continuous data like elevation, temperature, and satellite imagery. - Example: A satellite image of land cover, where each pixel represents vegetation, water, or urban areas. ## **Common Geospatial Data Formats** Different formats are used to store and exchange geospatial data. Some of the most common include: - **Shapefiles (.shp)** -- A widely used vector format consisting of multiple files that store geometry and attribute data.\ - **GeoJSON (.geojson)** -- A JSON-based format for encoding spatial data, commonly used for web mapping.\ - **KML (.kml)** -- A format developed by Google for geographic visualization in Google Earth and Maps.\ - **Census Data (via API or TIGER/Line Shapefiles)** -- Demographic and boundary data provided by the U.S. Census Bureau. # Overview of Key R Packages for Geospatial Data R provides powerful packages for working with geospatial data. Here are the key ones we'll use in this workshop: ### `sf` (Simple Features) - Provides a modern approach to handling vector geospatial data in R. - Replaces older packages like `sp` and `rgdal`. - Supports reading, writing, and manipulating spatial data. ```{r} # Install the package `sf` # install.packages("sf") # Load the package library(sf) ``` ### `tigris` - Retrieves geographic boundary data from the U.S. Census Bureau (e.g., states, counties, tracts). - Works with `sf` for mapping and spatial analysis. **Example:** Download state boundaries ```{r} # install.packages('tigris') library(tigris) # We can download boundary data for the US states states_sf <- states(cb = TRUE, resolution = "20m", progress_bar = FALSE) ``` ### `maps` - Provides simple built-in maps for U.S. states, counties, and world boundaries. - Useful for quick visualizations. **Example:** Plot a basic map of the U.S. ```{r} # Install 'maps' # install.packages('maps') # Load maps alone with tidyverse to access ggplot library(tidyverse) library(maps) # Download U.S. map data us_states <- map_data("state") # Plot the U.S. map ggplot(us_states, aes(long, lat, group = group)) + geom_polygon(fill = "lavender", color = "black") + theme_void() ``` This is great, but we're missing Alaska, Hawaii, Puerto Rico, etc. We can use `tigris` instead for the same purposes. ### `ggplot2`'s `geom_sf` for Geospatial Visualization - `geom_sf()` allows for powerful mapping of spatial data. - Enables custom styling and integration with non-spatial data. ```{r} # Let's plot this using the `sf` object from tigris ggplot(states_sf) + geom_sf(fill = "lavender", color = "black") ``` But notice how our map looks very tiny and spread out. :scream:: That is because the shape file is using raw longitude and latitude. Since this plot is very hard to read, we can use a transformation to shift the non-continental U.S. territories near the rest of the U.S.: ```{r} # We'll use the `tigris` function `shift_geometry` when saving the states file states_sf <- states(cb = TRUE, resolution = "20m") %>% shift_geometry() # Now we can try plotting again: ggplot(states_sf) + geom_sf(fill = "lavender", color = "black") ``` I like to add `theme_void` to get rid of those coordinate lines: ```{r} # We'll use the `tigris` function `shift_geometry` when saving the states file states_sf <- states(cb = TRUE, resolution = "20m") %>% shift_geometry() # Now we can try plotting again: ggplot(states_sf) + geom_sf(fill = "lavender", color = "black") + theme_void() ``` ## **Summary** - **Geospatial data** is any data with a geographic component, stored in vector (points, lines, polygons) or raster (grids) formats. - **Common formats** include Shapefiles, GeoJSON, KML, and Census data. - **Key R packages**: - `sf` for spatial data manipulation. - `tidycensus` for retrieving Census data. - `tigris` for downloading geographic boundaries. - `maps` for built-in map data. - `ggplot2` for visualization. ## Exercise: plot your home state and city I'll do the same for Illinois and Chicago ```{r} # get the data for IL counties # get data il_counties <- map_data("county", "illinois") %>% select(lon = long, lat, group, id = subregion) # Chicago and Evanston lon and lat (looked it up on Google) # Chicago: 41.8781° N, -87.6298° W # Evanston: 42.0521° N, -87.6848° W il_city_data <- tibble( city_name = c("Chicago", "Evanston"), lon = c(-87.6298, -87.6848), lat = c(41.8781, 42.0521) ) # Make map with cities ggplot(il_counties, aes(lon, lat)) + geom_polygon(aes(group = group), fill = "slateblue", col = 'grey50') + ggstar::geom_star(data = il_city_data, size = 4, fill = "gold", mapping = aes(fill = city_name)) + geom_label(data = il_city_data, aes(label = city_name), nudge_x = c(-.5, -.8), nudge_y = c(-.3, 0), size = 3.2) + coord_quickmap() + theme_void() + ggtitle("Illinois: the Lincoln State") ``` ## YOUR TURN :raised_hands: : ## Using my code above as a guide, plot your home state/any state of your choosing and its capital and/or your home town. ------------------------------------------------------------------------ # Using tidycensus for Census Data Retrieval and Visualization ## What is tidycensus? The tidycensus package is an R interface to the U.S. Census Bureau API. It is useful for mapping, analyzing, and visualizing demographic data in R. It allows users to retrieve demographic and socioeconomic data from: - The American Community Survey (ACS) (yearly estimates) - The Decennial Census (10-year population counts) - Other Census datasets Here are a few of the commands you can use to access Census Bureau products: - `get_acs()`, which requests data from the 1-year and 5-year **American Community Survey** samples. Data are available from the 1-year ACS back to 2005 and the 5-year ACS back to 2005-2009. - `get_estimates()`, an interface to the **Population Estimates** APIs. These datasets include yearly estimates of population characteristics by state, county, and metropolitan area, along with components of change demographic estimates like births, deaths, and migration rates. - `get_flows()`, an interface to the ACS **Migration Flows** APIs. Includes information on in- and out-flows from various geographies for the 5-year ACS samples, enabling origin-destination analyses. - `get_pums()`, which accesses data from the ACS **Public Use Microdata** Sample APIs. These samples include anonymized individual-level records from the ACS organized by household and are highly useful for many different social science analyses.[^1] [^1]: Source: *Analyzing US Census Data: Methods, Maps, and Models in R* by Kyle Walker ## Install and Load `tidycensus` in R You should have received a Census API key to the email you entered earlier. ```{r} # Install the `tidycensus` # install.packages("tidycensus") # Load the package library(tidycensus) ### Register API key ## Option 1: temporary, save it only for this session ## (delete # in front of the following line when ready to run code) # census_api_key("YOUR_API_KEY_HERE") # Option 2 (my recommendation): Save the API Key for Future Use ## to avoid entering the key every time, store it permanently: ## (delete # in front of the following line when ready to run code) # census_api_key("YOUR_API_KEY_HERE", install = TRUE) ``` Replace `"YOUR_API_KEY_HERE"` with the key you received. Option 2 will save the key to your `.Renviron` file, so it loads automatically in future R sessions. Restart R for the change to take effect. ## Getting ACS Data **The ACS** provides yearly estimates on income, education, race, and other demographics. Use `get_acs()` to retrieve ACS data. - Example: Get median household income by state for 2024 ### Let's get some ACS data and test the API Key To verify that the key is working, try fetching some ACS (American Community Survey) data. First, let's load the variables in the ACS for the year we'd like to gather data from. We use the call `load_variables` We can also find them listed on the [Census website](https://api.census.gov/data/2023/acs/acs5/profile/variables.html) ```{r} # Load variable list vars23 <- load_variables(2023, "acs5", cache = TRUE) # We can take a look and search for variables of interest view(vars23) ``` But as you can see, it can be overwhelming to search. Here is a trick to quicken our search: ```{r} # Let's filter all variables with 'median household income' in the label column income_vars <- vars23 %>% filter(grepl('median household income', label, ignore.case = TRUE)) view(income_vars) ``` The variable `B19013_001` with label `Estimate!!Median household income in the past 12 months (in 2023 inflation-adjusted dollars)` and concept `Median Household Income in the Past 12 Months (in 2023 Inflation-Adjusted Dollars)` looks perfect for our needs. Now that we know our variable of interest, let's define our ACS call by state for 2023: ```{r} # Using the get_acs function, we can get Census data for specific geogprahies and specific variables df <- get_acs(geography = "state", variables = "B19013_001", # Median Household Income year = 2023, progress_bar = FALSE) # Let's take a sneak peak of our data frame head(df) ``` ::: call-out.tip If the data loads without errors, your API key is successfully set up! ::: The dataset includes: - GEOID (geographic ID) - NAME (state name) - estimate (median income value) - moe (margin of error) ## Combining Census Data with Geospatial Data We can skip a step by requesting geographic data simultaneously with Census data, using the argument `geometry = TRUE` ```{r} income_sf <- get_acs(geography = "state", variables = c(median_income = "B19013_001"), year = 2023, geometry = TRUE, progress_bar = FALSE) %>% # Get spatial data shift_geometry() # Don't forget or else we get a weird map! ``` Now we can create different types of visualizations ```{r} # Install viridis color package # install.packages('viridis') library(viridis) # Plot ggplot(income_sf) + geom_sf(aes(fill = estimate), color = "white") + scale_fill_viridis_c(option = "magma", name = "Median Income") + theme_void() + labs(title = "Median Household Income by State (2023)", caption = "Source: U.S. Census Bureau ACS 2023\n*note: Puerto Rico not included") ``` Key Features of this Plot: - `aes(fill = estimate)` colors states by income - `scale_fill_viridis_c()` improves color contrast - `theme_minimal()` gives a clean look Okay, the map looks good but that scaling on the legend is kind of ugly. Let's fix that: ```{r} # Install and load the 'scales' library # install.packages('scales') library(scales) # Plot with pretty numbers options(scipen = 999) # this will prevent the plots from producing a default scientific notation ggplot(income_sf) + geom_sf(aes(fill = estimate), color = "white") + scale_fill_viridis_c(option = "magma", name = "Median Income", labels = scales::comma) + # this will add a comma for thousands theme_void() + labs(title = "Median Household Income by State (2023)", caption = "Source: U.S. Census Bureau ACS 2023\n*note: Puerto Rico not included") ``` ## Top states by income ```{r} top_income <- income_sf %>% arrange(desc(estimate)) %>% slice(1:10) ggplot(top_income, aes(x = reorder(NAME, estimate), y = estimate)) + geom_col(fill = "steelblue") + coord_flip() + # Flip for readability theme_minimal() + scale_y_continuous(labels = scales::comma) + # same scale trick, but note we use `scale_y_continuous` since this time we are targeting the x-axis labs(title = "Top 10 States by Median Income (2023)", x = "State", y = "Median Income", caption = "Source: U.S. Census Bureau ACS 2023") ``` ### We can also use scatter plots along with maps to display population For this, gathering population per county and mapping it onto states may be best ```{r} # Get county-level population data county_pop <- get_acs( geography = "county", variables = "B01003_001", # Total population variable year = 2023, geometry = TRUE, progress_bar = FALSE ) %>% shift_geometry() # Extract centroid coordinates for plotting as points county_pop <- county_pop %>% mutate( lon = st_coordinates(st_centroid(geometry))[,1], # Longitude lat = st_coordinates(st_centroid(geometry))[,2] # Latitude ) # filter(estimate >= 50000) # only keep counties with more than 50k people # Check data structure head(county_pop) ``` Aggregate Population Data to State Level ```{r} state_pop <- county_pop %>% group_by(GEOID = substr(GEOID, 1, 2)) %>% # Use first 2 digits of GEOID (State FIPS) summarise(total_population = sum(estimate, na.rm = TRUE)) # Load state geometries states_sf <- get_acs( geography = "state", variables = "B01003_001", # Total population variable year = 2023, geometry = TRUE, progress_bar = FALSE ) %>% shift_geometry() ``` Now we plot: ```{r} ggplot() + # Draw state boundaries geom_sf(data = states_sf, fill = "gray90", color = "white", size = 0.3) + # Overlay county population as proportional dots using geom_point() geom_point(data = county_pop, aes(x = lon, y = lat, size = estimate), color = "coral", alpha = 0.2) + # Scale dot sizes to be visually proportional scale_size_continuous( range = c(1, 10), # Adjust min/max dot sizes labels = comma, # Format population labels with commas name = "County Population" ) + # Titles and theme theme_void() + labs( title = "County Population Distribution in the U.S.", subtitle = "Larger circles represent higher county populations", caption = "Source: U.S. Census Bureau ACS 2022" ) ``` We can also adjust color to the population size: ```{r} # Plot ggplot() + # Draw state boundaries geom_sf(data = states_sf, fill = "gray70", color = "white", size = 0.3) + # Overlay county population as proportional dots, with adjusted transparency geom_point(data = county_pop, aes(x = lon, y = lat, size = estimate, color = estimate, alpha = estimate)) + # Adjust alpha (transparency): Larger values = more opaque scale_alpha_continuous(range = c(0.4, 1), guide = "none") + # Avoid cluttering the legend # Unified legend for size and color (ordered ascending) scale_size_area( max_size = 10, # Adjust max dot size labels = comma, # Format population labels with commas name = "County Population", guide = guide_legend(order = 1) # Ensure size legend comes first ) + scale_color_viridis_c( #option = "mako", # Color scheme (magma = best for ordered data) labels = comma, # Format labels direction = -1, # reverse so darker = larger population name = "County Population", guide = guide_colorbar(order = 2) # Ensure color legend comes second ) + # Titles and theme theme_void() + labs( title = "County Population Distribution (50K+ Population)", subtitle = "Larger and darker circles represent more populous counties (more opaque)", caption = "Source: U.S. Census Bureau ACS 2022" ) ``` ## YOUR TURN :raised_hands: : ## Choose another variable of interest and gather Census ACS data & plot it using the above code as a guide ### Common ACS Variables Here are a few frequently used variables: | Variable Name | Description | |:--------------|:-------------------------------| | B19013_001 | Median Household Income | | B17001_002 | Population Below Poverty Level | | B25077_001 | Median Home Value | | B01003_001 | Total Population | | B03002_003 | Non-Hispanic White Population | | B03002_004 | Black Population | | B03002_012 | Hispanic or Latino Population | Here are other examples of useful variables: 1. Demographic Analysis (Population by Race, Ethnicity, Age, Gender) | Variable Name | Description | |:--------------|:-------------------------------------| | B01003_001 | Total Population | | B01002_001 | Median Age | | B01001_002 | Male Population | | B01001_026 | Female Population | | B03002_003 | Non-Hispanic White Population | | B03002_004 | Black or African American Population | | B03002_012 | Hispanic or Latino Population | | B03002_006 | Asian Population | : :small_orange_diamond: Use Case: Population distributions, racial/ethnic composition, gender breakdowns. 2. Socioeconomic Status (SES) & Income Inequality | Variable Name | Description | |:-------------------------|:-------------------------------| | B19013_001 | Median Household Income | | B19001_002 -- B19001_017 | Household Income Brackets | | B17001_002 | Population Below Poverty Level | | B19301_001 | Per Capita Income | | B25077_001 | Median Home Value | | B25064_001 | Median Gross Rent | : :small_orange_diamond: Use Case: Economic disparities, income inequality, housing affordability. 3. Education & Employment | Variable Name | Description | |:--------------|:----------------------------| | B15003_022 | Bachelor's Degree or Higher | | B15003_017 | High School Graduates | | B23025_002 | Civilian Labor Force | | B23025_005 | Unemployment Count | | B23025_007 | Unemployment Rate | : :small_orange_diamond: Use Case: Workforce analysis, educational attainment, economic mobility. 4. Housing & Urban Development | Variable Name | Description | |:--------------|:----------------------------| | B25077_001 | Median Home Value | | B25064_001 | Median Gross Rent | | B25034_001 | Housing Units by Year Built | | B25002_003 | Vacant Housing Units | | B25024_002 | Single-Family Homes | | B25024_010 | Mobile Homes | : :small_orange_diamond: Use Case: Gentrification studies, housing affordability, displacement trends. 5. Health & Disability | Variable Name | Description | |:--------------|:------------------------------------------| | B18101_001 | Population with a Disability | | B27010_001 | Population with Health Insurance Coverage | | B27010_017 | Population Uninsured | | B25099_001 | Households with No Plumbing Facilities | : :small_orange_diamond: Use Case: Healthcare access, disability prevalence, environmental health. ## Using state bins This is another approach for data visualization of state data where we ditch the state's natural shape to emphasize comparisons among them. ```{r} # Install and load 'statebins' # install.packages('statebins') library(statebins) # Using the income dataframe from earlier, let's add income levels US_income <- mutate( income_sf, income_bins = cut( ifelse(is.na(estimate), 25000, estimate), breaks = c(0, 40000, 50000, 60000, 70000, 80000, 150000), labels = c("< $40k", "$40k to $50k", "$50k to $60k", "$60k to $70k", "$70k - $80k", " > $80k"), right = FALSE ) ) # Plotting US_income %>% ggplot() + statebins::geom_statebins(aes(geometry = geometry, state = NAME, fill = income_bins)) + scale_fill_viridis_d() + theme_statebins() + labs(title = "Median Household Income by State (2023)", caption = "Source: U.S. Census Bureau ACS 2023. Note: Puerto Rico not included") ``` # Advanced example: using multiple variables at once ```{r} # Let's gather data for income and poverty for states in 2023 data <- get_acs(geography = "state", variables = c(income = "B19013_001", poverty = "B17001_002", population = "B01003_001"), year = 2023, geometry = TRUE, progress_bar = FALSE) %>% shift_geometry() head(data) ``` Notice how our variables are all listed under the variable column. For our analyses, it would be fruitful to have the variables as their own columns. To do this, we will pivot the data long-ways so that each variable is its own column. ```{r} # Transpose the data so each variable has its own column} data_wide <- data %>% select(GEOID, state = NAME, variable, estimate, geometry) %>% # Keep relevant columns distinct(GEOID, state, variable, estimate, .keep_all = TRUE) %>% # Remove duplicate rows pivot_wider( names_from = variable, values_from = estimate ) %>% mutate(poverty_rate = (poverty / population) * 100) # Compute Poverty Rate # Let's look at our new data set: head(data_wide) ``` ```{r} # Install package`RColorBrewer` # install.packages('RColorBrewer') library(RColorBrewer) # For color scales # Define bivariate color scale using ColorBrewer bivariate_palette <- c( "Low-Low" = brewer.pal(9, "PuBuGn")[1], # Light green "Low-Medium" = brewer.pal(9, "PuBuGn")[3], "Low-High" = brewer.pal(9, "PuBuGn")[5], "Medium-Low" = brewer.pal(9, "Purples")[3], "Medium-Medium" = brewer.pal(9, "Purples")[5], "Medium-High" = brewer.pal(9, "Purples")[7], "High-Low" = brewer.pal(9, "Oranges")[3], "High-Medium" = brewer.pal(9, "Oranges")[5], "High-High" = brewer.pal(9, "Oranges")[7] # Dark orange ) # Assign colors to the dataset data_wide <- data_wide %>% mutate( income_cat = cut(income, breaks = quantile(income, probs = seq(0, 1, length.out = 4), na.rm = TRUE), labels = c("Low", "Medium", "High"), include.lowest = TRUE), poverty_cat = cut(poverty_rate, breaks = quantile(poverty_rate, probs = seq(0, 1, length.out = 4), na.rm = TRUE), labels = c("Low", "Medium", "High"), include.lowest = TRUE), category = paste(income_cat, poverty_cat, sep = "-") # Create bivariate categories ) ``` ## How This Works Each state (or region) falls into one of nine possible categories based on: - Income level: Low, Medium, or High - Poverty rate: Low, Medium, or High Each category is assigned a unique color using three different ColorBrewer palettes: - PuBuGn (Blue-Green shades) → Used for Low Income - Purples (Purple shades) → Used for Medium Income - Oranges (Orange shades) → Used for High Income Darker colors indicate higher poverty rates, making the map intuitive: - "Low-Low" (Low income, Low poverty) → Lightest Green - "Low-High" (Low income, High poverty) → Darker Green - "High-Low" (High income, Low poverty) → Light Orange - "High-High" (High income, High poverty) → Darkest Orange ### Breakdown of Color Assignments I chose the colors using the link for the [Brewer Color Palettes](https://r-graph-gallery.com/38-rcolorbrewers-palettes.html). | Category (Income-Poverty) | Color Palette Used | Shade | |:----------------------------------------------|:------------------------------|:-----------------------| | Low-Low (Low income, Low poverty) | PuBuGn (Blue-Green) | Lightest Green (\[1\]) | | Low-Medium (Low income, Medium poverty) | PuBuGn | Medium Green (\[3\]) | | Low-High (Low income, High poverty) | PuBuGn | Darker Green (\[5\]) | | Medium-Low (Medium income, Low poverty) | Purples | Light Purple (\[3\]) | | Medium-Medium (Medium income, Medium poverty) | Purples | Medium Purple (\[5\]) | | Medium-High (Medium income, High poverty) | Purples Darker Purple (\[7\]) | | | High-Low (High income, Low poverty) | Oranges | Light Orange (\[3\]) | | High-Medium (High income, Medium poverty) | Oranges | Medium Orange (\[5\]) | | High-High (High income, High poverty) | Oranges | Darkest Orange (\[7\]) | ```{r} # Create plot ggplot(data_wide) + geom_sf(aes(fill = category), color = "white", size = 0.2) + scale_fill_manual(values = bivariate_palette, name = "Income-Poverty Scale") + # Add legend theme_void() + labs( title = "Income & Poverty Rate by U.S. State (2023)", subtitle = "Bivariate Choropleth Map", caption = "Source: U.S. Census Bureau ACS 2023" ) ``` # Using Census data with other data sources Census data can be used alongside other political data sources (such as ANES, election data, or even your own survey or qualitative data)! ## Let's use the income-poverty data and compare it to red vs blue states. First let's make a data frame with the states and their respective 2024 vote ```{r} # Using data from Politico on 2024 Presidential election #red_states ~ "#fb5c44" #blue_states ~ "#336fba" state_colors <- c("Idaho" = "#fb5c44", "Montana" = "#fb5c44", "North Dakota" = "#fb5c44", "South Dakota" = "#fb5c44", "Wyoming" = "#fb5c44", "Utah" = "#fb5c44", "Nebraska" = "#fb5c44", "Kansas" = "#fb5c44", "Oklahoma" = "#fb5c44", "Texas" = "#fb5c44", "Louisiana" = "#fb5c44", "Mississippi" = "#fb5c44", "Alabama" = "#fb5c44", "Iowa" = "#fb5c44", "Missouri" = "#fb5c44", "Arkansas" = "#fb5c44", "Tennessee" = "#fb5c44", "Kentucky" = "#fb5c44", "Indiana" = "#fb5c44", "Ohio" = "#fb5c44", "Nevada" = "#fb5c44", "Arizona" = "#fb5c44", "Pennsylvania" = "#fb5c44", "Georgia" = "#fb5c44", "Michigan" = "#fb5c44", "Wisconsin" = "#fb5c44", "West Virginia" = "#fb5c44", "North Carolina" = "#fb5c44", "South Carolina" = "#fb5c44", "Florida" = "#fb5c44", "Alaska" = "#fb5c44", "Washington" = "#336fba", "Oregon" = "#336fba", "California" = "#336fba", "Colorado" = "#336fba", "New Mexico" = "#336fba", "Minnesota" = "#336fba", "Massachusetts" = "#336fba", "Illinois" = "#336fba", "Virginia" = "#336fba", "Maryland" = "#336fba", "District Of Columbia" = "#336fba", "Connecticut" = "#336fba", "Rhode Island" = "#336fba", "Delaware" = "#336fba", "Maine" = "#336fba", "New York" = "#336fba", "New Jersey" = "#336fba", "Hawaii" = "#336fba", "New Hampshire" = "#336fba", "Vermont" = "#336fba" ) color_vote <- as.data.frame(state_colors) color_vote <- tibble::rownames_to_column(color_vote, "state") ``` Next, let's modify the data set to sort by top poverty rate and income, respectively, and make two different plots ```{r} # Let's plot poverty rate and income poverty_colors <- # let's join the data sets while also sorting by top poverty rate data_wide %>% inner_join(color_vote, by = "state") %>% filter(state != "Puerto Rico") %>% mutate_if(is.numeric, round, 2) %>% top_n(10, poverty_rate) %>% # Now we can plot ggplot(aes(x = reorder(state, poverty_rate), y = poverty_rate, fill = state_colors)) + geom_col() + scale_fill_manual( name = "2024 Presidential Election Result", labels = c("Blue", "Red"), values = c("#336fba","#fb5c44"), guide = "none" ) + coord_flip() + geom_text(aes(label = poverty_rate), color = "white", hjust = 1.1, size = 3) + theme_bw() + labs( y = "Poverty Rate", title = 'Top 10 states by poverty rate\n& 2024 Presidential vote' ) + theme( axis.title.y = element_blank(), axis.ticks.x = element_blank()) income_colors <- # let's join the data sets while also sorting by top income data_wide %>% inner_join(color_vote, by = "state") %>% filter(state != "Puerto Rico") %>% mutate_if(is.numeric, round, 2) %>% top_n(10, income) %>% # Now we can plot ggplot(aes(x = reorder(state, income), y = income, fill = state_colors)) + geom_col() + scale_fill_manual( name = "2024 Presidential Election Result", labels = c("Blue", "Red"), values = c("#336fba","#fb5c44"), guide = "none" ) + scale_y_continuous(labels = scales::comma) + coord_flip() + geom_text(aes(label = scales::comma(income)), color = "white", hjust = 1.1, size = 3) + theme_bw() + labs( y = "Median Household Income", title = 'Top 10 states by income\n& 2024 Presidential vote', caption = "\nU.S. Census Bureau 2023 ACS 5-Year Population Estimates \n Politico 2024 Election Mapping." ) + theme( axis.title.y = element_blank(), axis.ticks.x = element_blank()) ``` Finally, let's view these plots side by side and see if there are interesting conclusions we can make about the interactions between poverty/income and 2024 vote choice. ```{r} # Install patchwork to view these side by side # install.packages('patchwork') library(patchwork) poverty_colors + income_colors ``` ## YOUR TURN :raised_hands: : Using the color_vote data from the 2024 election, find another variable you are interested in to visualize along with vote choice.

Welcome! 🗺️

Step-by-Step Instructions for Getting a Census API Key for tidycensus

Request a Census API Key

Introduction to Geospatial Data in R

What is Geospatial Data?

Common Geospatial Data Formats

Overview of Key R Packages for Geospatial Data

sf (Simple Features)

tigris

maps

ggplot2’s geom_sf for Geospatial Visualization

Summary

Exercise: plot your home state and city

YOUR TURN 🙌 :

Using my code above as a guide, plot your home state/any state of your choosing and its capital and/or your home town.

Using tidycensus for Census Data Retrieval and Visualization

What is tidycensus?

Install and Load tidycensus in R

Getting ACS Data

Let’s get some ACS data and test the API Key

Combining Census Data with Geospatial Data

Top states by income

We can also use scatter plots along with maps to display population

YOUR TURN 🙌 :

Choose another variable of interest and gather Census ACS data & plot it using the above code as a guide

Common ACS Variables

Using state bins

Advanced example: using multiple variables at once

How This Works

Breakdown of Color Assignments

Using Census data with other data sources

Let’s use the income-poverty data and compare it to red vs blue states.

YOUR TURN 🙌 : Using the color_vote data from the 2024 election, find another variable you are interested in to visualize along with vote choice.

Footnotes

Step-by-Step Instructions for Getting a Census API Key for `tidycensus`

`sf` (Simple Features)

`tigris`

`maps`

`ggplot2`’s `geom_sf` for Geospatial Visualization

Install and Load `tidycensus` in R