R Workshop: Geographic and Demographic Data Analysis and Visualization

Author

Michelle Bueno Vásquez

Published

February 24, 2025

Welcome! 🗺️

Setting up a Census API Key

Before we get started, I will have you all go ahead and request a Census API key which will allow us to access Census data for our analyses. It takes about 2-5 minutes to send, so let’s get started with this while we chat a little about spatial data 😊

Step-by-Step Instructions for Getting a Census API Key for tidycensus

The U.S. Census Bureau provides an API that allows users to access various datasets, including the American Community Survey (ACS) and the Decennial Census. To use tidycensus in R, you’ll need to obtain and register an API key.

Request a Census API Key

  1. Go to the U.S. Census API Key Request Page:

  2. Fill out the form:

    • Enter your name
    • Enter your email address
    • Agree to the terms of service
  3. Submit the request.
    You will receive an API key via email (a long alphanumeric string).


Introduction to Geospatial Data in R

What is Geospatial Data?

Geospatial data refers to any data that has a geographic component, meaning it is tied to specific locations on the Earth’s surface. This data is used in mapping, spatial analysis, and geographic visualization.

There are two primary types of geospatial data:
1. Vector Data: Represents geographic features as points, lines, or polygons.
Examples:
- Points: Locations of schools, stores, or crime incidents
- Lines: Roads, rivers, or flight paths
- Polygons: State boundaries, land parcels, Census tracts

  1. Raster Data: Represents spatial data as a grid of pixels, commonly used for continuous data like elevation, temperature, and satellite imagery.
    • Example: A satellite image of land cover, where each pixel represents vegetation, water, or urban areas.

Common Geospatial Data Formats

Different formats are used to store and exchange geospatial data. Some of the most common include:

  • Shapefiles (.shp) – A widely used vector format consisting of multiple files that store geometry and attribute data.
  • GeoJSON (.geojson) – A JSON-based format for encoding spatial data, commonly used for web mapping.
  • KML (.kml) – A format developed by Google for geographic visualization in Google Earth and Maps.
  • Census Data (via API or TIGER/Line Shapefiles) – Demographic and boundary data provided by the U.S. Census Bureau.

Overview of Key R Packages for Geospatial Data

R provides powerful packages for working with geospatial data. Here are the key ones we’ll use in this workshop:

sf (Simple Features)

  • Provides a modern approach to handling vector geospatial data in R.

  • Replaces older packages like sp and rgdal.

  • Supports reading, writing, and manipulating spatial data.

# Install the package `sf`
# install.packages("sf")

# Load the package
library(sf)

tigris

  • Retrieves geographic boundary data from the U.S. Census Bureau (e.g., states, counties, tracts).

  • Works with sf for mapping and spatial analysis.

Example: Download state boundaries

# install.packages('tigris')
library(tigris)

# We can download boundary data for the US states
states_sf <- states(cb = TRUE,
                    resolution = "20m",
                    progress_bar = FALSE)

maps

  • Provides simple built-in maps for U.S. states, counties, and world boundaries.

  • Useful for quick visualizations.

Example: Plot a basic map of the U.S.

# Install 'maps'
# install.packages('maps')

# Load maps alone with tidyverse to access ggplot
library(tidyverse)
library(maps)

# Download U.S. map data
us_states <- map_data("state")

# Plot the U.S. map
ggplot(us_states, 
       aes(long, lat, 
           group = group)) + 
  geom_polygon(fill = "lavender", 
               color = "black") +
  theme_void()

This is great, but we’re missing Alaska, Hawaii, Puerto Rico, etc. We can use tigris instead for the same purposes.

ggplot2’s geom_sf for Geospatial Visualization

  • geom_sf() allows for powerful mapping of spatial data.
  • Enables custom styling and integration with non-spatial data.
# Let's plot this using the `sf` object from tigris
ggplot(states_sf) +
    geom_sf(fill = "lavender", color = "black")

But notice how our map looks very tiny and spread out. 😱: That is because the shape file is using raw longitude and latitude. Since this plot is very hard to read, we can use a transformation to shift the non-continental U.S. territories near the rest of the U.S.:

# We'll use the `tigris` function `shift_geometry` when saving the states file
states_sf <- states(cb = TRUE,
                    resolution = "20m") %>% 
  shift_geometry()

# Now we can try plotting again:
ggplot(states_sf) +
    geom_sf(fill = "lavender", 
            color = "black") 

I like to add theme_void to get rid of those coordinate lines:

# We'll use the `tigris` function `shift_geometry` when saving the states file
states_sf <- states(cb = TRUE,
                    resolution = "20m") %>% 
  shift_geometry()

# Now we can try plotting again:
ggplot(states_sf) +
    geom_sf(fill = "lavender", 
            color = "black") +
  theme_void()

Summary

  • Geospatial data is any data with a geographic component, stored in vector (points, lines, polygons) or raster (grids) formats.
  • Common formats include Shapefiles, GeoJSON, KML, and Census data.
  • Key R packages:
    • sf for spatial data manipulation.
    • tidycensus for retrieving Census data.
    • tigris for downloading geographic boundaries.
    • maps for built-in map data.
    • ggplot2 for visualization.

Exercise: plot your home state and city

I’ll do the same for Illinois and Chicago

# get the data for IL counties
# get data
il_counties <- map_data("county", "illinois") %>% 
  select(lon = long, lat, group, id = subregion)


# Chicago and Evanston lon and lat (looked it up on Google)
# Chicago: 41.8781° N, -87.6298° W
# Evanston: 42.0521° N, -87.6848° W
il_city_data <- tibble(
  city_name = c("Chicago", "Evanston"),
  lon = c(-87.6298, -87.6848),
  lat = c(41.8781, 42.0521)
)


# Make map with cities
ggplot(il_counties, aes(lon, lat)) +
  geom_polygon(aes(group = group), 
               fill = "slateblue", 
               col = 'grey50') +
  ggstar::geom_star(data = il_city_data, 
                    size = 4,
                    fill = "gold",
                    mapping = aes(fill = city_name)) +
 geom_label(data = il_city_data, 
            aes(label = city_name), 
            nudge_x = c(-.5, -.8),
            nudge_y = c(-.3, 0),
            size = 3.2) +
  coord_quickmap() +
  theme_void() +
  ggtitle("Illinois: the Lincoln State") 

YOUR TURN 🙌 :

Using my code above as a guide, plot your home state/any state of your choosing and its capital and/or your home town.


Using tidycensus for Census Data Retrieval and Visualization

What is tidycensus?

The tidycensus package is an R interface to the U.S. Census Bureau API. It is useful for mapping, analyzing, and visualizing demographic data in R.

It allows users to retrieve demographic and socioeconomic data from:

  • The American Community Survey (ACS) (yearly estimates)
  • The Decennial Census (10-year population counts)
  • Other Census datasets

Here are a few of the commands you can use to access Census Bureau products:

  • get_acs(), which requests data from the 1-year and 5-year American Community Survey samples. Data are available from the 1-year ACS back to 2005 and the 5-year ACS back to 2005-2009.

  • get_estimates(), an interface to the Population Estimates APIs. These datasets include yearly estimates of population characteristics by state, county, and metropolitan area, along with components of change demographic estimates like births, deaths, and migration rates.

  • get_flows(), an interface to the ACS Migration Flows APIs. Includes information on in- and out-flows from various geographies for the 5-year ACS samples, enabling origin-destination analyses.

  • get_pums(), which accesses data from the ACS Public Use Microdata Sample APIs. These samples include anonymized individual-level records from the ACS organized by household and are highly useful for many different social science analyses.1

Install and Load tidycensus in R

You should have received a Census API key to the email you entered earlier.

# Install the `tidycensus`
# install.packages("tidycensus")

# Load the package
library(tidycensus)

### Register API key 
## Option 1: temporary, save it only for this session 
## (delete # in front of the following line when ready to run code)

# census_api_key("YOUR_API_KEY_HERE")

# Option 2 (my recommendation): Save the API Key for Future Use 
## to avoid entering the key every time, store it permanently: 
## (delete # in front of the following line when ready to run code)

# census_api_key("YOUR_API_KEY_HERE", install = TRUE)

Replace "YOUR_API_KEY_HERE" with the key you received.

Option 2 will save the key to your .Renviron file, so it loads automatically in future R sessions. Restart R for the change to take effect.

Getting ACS Data

The ACS provides yearly estimates on income, education, race, and other demographics. Use get_acs() to retrieve ACS data.

  • Example: Get median household income by state for 2024

Let’s get some ACS data and test the API Key

To verify that the key is working, try fetching some ACS (American Community Survey) data.

First, let’s load the variables in the ACS for the year we’d like to gather data from. We use the call load_variables

We can also find them listed on the Census website

# Load variable list
vars23 <- load_variables(2023, "acs5", cache = TRUE)

# We can take a look and search for variables of interest
view(vars23)

But as you can see, it can be overwhelming to search. Here is a trick to quicken our search:

# Let's filter all variables with 'median household income' in the label column
income_vars <- vars23 %>% 
    filter(grepl('median household income', label, ignore.case = TRUE))

view(income_vars)

The variable B19013_001 with label Estimate!!Median household income in the past 12 months (in 2023 inflation-adjusted dollars) and concept Median Household Income in the Past 12 Months (in 2023 Inflation-Adjusted Dollars) looks perfect for our needs.

Now that we know our variable of interest, let’s define our ACS call by state for 2023:

# Using the get_acs function, we can get Census data for specific geogprahies and specific variables
df <- get_acs(geography = "state", 
              variables = "B19013_001",  # Median Household Income
              year = 2023,
              progress_bar = FALSE)

# Let's take a sneak peak of our data frame
head(df)
# A tibble: 6 × 5
  GEOID NAME       variable   estimate   moe
  <chr> <chr>      <chr>         <dbl> <dbl>
1 01    Alabama    B19013_001    62027   400
2 02    Alaska     B19013_001    89336  1374
3 04    Arizona    B19013_001    76872   414
4 05    Arkansas   B19013_001    58773   503
5 06    California B19013_001    96334   298
6 08    Colorado   B19013_001    92470   483

If the data loads without errors, your API key is successfully set up!

The dataset includes:

  • GEOID (geographic ID)
  • NAME (state name)
  • estimate (median income value)
  • moe (margin of error)

Combining Census Data with Geospatial Data

We can skip a step by requesting geographic data simultaneously with Census data, using the argument geometry = TRUE

income_sf <- get_acs(geography = "state", 
                     variables = c(median_income = "B19013_001"), 
                     year = 2023, 
                     geometry = TRUE,
                     progress_bar = FALSE) %>% # Get spatial data
              shift_geometry()            # Don't forget or else we get a weird map!

Now we can create different types of visualizations

# Install viridis color package
# install.packages('viridis')
library(viridis)

# Plot
ggplot(income_sf) +
  geom_sf(aes(fill = estimate), 
          color = "white") +
  scale_fill_viridis_c(option = "magma", 
                       name = "Median Income") +
  theme_void() +
  labs(title = "Median Household Income by State (2023)",
       caption = "Source: U.S. Census Bureau ACS 2023\n*note: Puerto Rico not included")

Key Features of this Plot:

  • aes(fill = estimate) colors states by income
  • scale_fill_viridis_c() improves color contrast
  • theme_minimal() gives a clean look

Okay, the map looks good but that scaling on the legend is kind of ugly.

Let’s fix that:

# Install and load the 'scales' library
# install.packages('scales')
library(scales)

# Plot with pretty numbers
options(scipen = 999)        # this will prevent the plots from producing a default scientific notation 

ggplot(income_sf) +
  geom_sf(aes(fill = estimate), 
          color = "white") +
  scale_fill_viridis_c(option = "magma", 
                       name = "Median Income",
                       labels = scales::comma) +    # this will add  a comma for thousands
  theme_void() +
  labs(title = "Median Household Income by State (2023)",
       caption = "Source: U.S. Census Bureau ACS 2023\n*note: Puerto Rico not included")

Top states by income

top_income <- income_sf %>%
  arrange(desc(estimate)) %>%
  slice(1:10)

ggplot(top_income, aes(x = reorder(NAME, estimate), y = estimate)) +
  geom_col(fill = "steelblue") +
  coord_flip() +  # Flip for readability
  theme_minimal() +
  scale_y_continuous(labels = scales::comma) +       # same scale trick, but note we use `scale_y_continuous` since this time we are targeting the x-axis
  labs(title = "Top 10 States by Median Income (2023)",
       x = "State", y = "Median Income",
       caption = "Source: U.S. Census Bureau ACS 2023")

We can also use scatter plots along with maps to display population

For this, gathering population per county and mapping it onto states may be best

# Get county-level population data
county_pop <- get_acs(
  geography = "county",
  variables = "B01003_001",  # Total population variable
  year = 2023,
  geometry = TRUE,
  progress_bar = FALSE
) %>% 
  shift_geometry()

# Extract centroid coordinates for plotting as points
county_pop <- county_pop %>%
  mutate(
    lon = st_coordinates(st_centroid(geometry))[,1],  # Longitude
    lat = st_coordinates(st_centroid(geometry))[,2]   # Latitude
  )  
#  filter(estimate >= 50000) # only keep counties with more than 50k people

# Check data structure
head(county_pop)
Simple feature collection with 6 features and 7 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 139720.3 ymin: -780463.1 xmax: 1044933 ymax: -368685.2
Projected CRS: USA_Contiguous_Albers_Equal_Area_Conic
  GEOID                    NAME   variable estimate moe
1 01003 Baldwin County, Alabama B01003_001   239945  NA
2 01069 Houston County, Alabama B01003_001   107628  NA
3 01005 Barbour County, Alabama B01003_001    24757  NA
4 01119  Sumter County, Alabama B01003_001    12020  NA
5 05091 Miller County, Arkansas B01003_001    42588  NA
6 05133 Sevier County, Arkansas B01003_001    15797  NA
                        geometry       lon       lat
1 MULTIPOLYGON (((765298 -780...  789795.1 -722243.2
2 MULTIPOLYGON (((975268 -651... 1014495.8 -652117.0
3 MULTIPOLYGON (((966818.1 -6...  997016.3 -573843.3
4 MULTIPOLYGON (((709323.3 -5...  727423.5 -519270.6
5 MULTIPOLYGON (((180424 -440...  194967.8 -466481.6
6 MULTIPOLYGON (((139720.4 -3...  161278.0 -390636.1

Aggregate Population Data to State Level

state_pop <- county_pop %>%
  group_by(GEOID = substr(GEOID, 1, 2)) %>%  # Use first 2 digits of GEOID (State FIPS)
  summarise(total_population = sum(estimate, na.rm = TRUE))

# Load state geometries
states_sf <- get_acs(
  geography = "state",
  variables = "B01003_001",  # Total population variable
  year = 2023,
  geometry = TRUE,
  progress_bar = FALSE
) %>% 
  shift_geometry()

Now we plot:

ggplot() +
  # Draw state boundaries
  geom_sf(data = states_sf, fill = "gray90", color = "white", size = 0.3) +

  # Overlay county population as proportional dots using geom_point()
  geom_point(data = county_pop, aes(x = lon, y = lat, size = estimate), 
             color = "coral", alpha = 0.2) +

  # Scale dot sizes to be visually proportional
  scale_size_continuous(
    range = c(1, 10),  # Adjust min/max dot sizes
    labels = comma,  # Format population labels with commas
    name = "County Population"
  ) +

  # Titles and theme
  theme_void() +
  labs(
    title = "County Population Distribution in the U.S.",
    subtitle = "Larger circles represent higher county populations",
    caption = "Source: U.S. Census Bureau ACS 2022"
  ) 

We can also adjust color to the population size:

# Plot
ggplot() +
  # Draw state boundaries
  geom_sf(data = states_sf, 
          fill = "gray70", 
          color = "white", 
          size = 0.3) +

  # Overlay county population as proportional dots, with adjusted transparency
  geom_point(data = county_pop, 
             aes(x = lon, 
                 y = lat, 
                 size = estimate, 
                 color = estimate, 
                 alpha = estimate)) +

  # Adjust alpha (transparency): Larger values = more opaque
  scale_alpha_continuous(range = c(0.4, 1), 
                         guide = "none") +  # Avoid cluttering the legend

  # Unified legend for size and color (ordered ascending)
  scale_size_area(
    max_size = 10,  # Adjust max dot size
    labels = comma,  # Format population labels with commas
    name = "County Population",
    guide = guide_legend(order = 1)  # Ensure size legend comes first
  ) +
  scale_color_viridis_c(
    #option = "mako",  # Color scheme (magma = best for ordered data)
    labels = comma,  # Format labels
    direction = -1,   # reverse so darker = larger population
    name = "County Population",
    guide = guide_colorbar(order = 2)  # Ensure color legend comes second
  ) +

  # Titles and theme
  theme_void() +
  labs(
    title = "County Population Distribution (50K+ Population)",
    subtitle = "Larger and darker circles represent more populous counties (more opaque)",
    caption = "Source: U.S. Census Bureau ACS 2022"
  ) 

YOUR TURN 🙌 :

Choose another variable of interest and gather Census ACS data & plot it using the above code as a guide

Common ACS Variables

Here are a few frequently used variables:

Variable Name Description
B19013_001 Median Household Income
B17001_002 Population Below Poverty Level
B25077_001 Median Home Value
B01003_001 Total Population
B03002_003 Non-Hispanic White Population
B03002_004 Black Population
B03002_012 Hispanic or Latino Population

Here are other examples of useful variables:

  1. Demographic Analysis (Population by Race, Ethnicity, Age, Gender)
🔸 Use Case: Population distributions, racial/ethnic composition, gender breakdowns.
Variable Name Description
B01003_001 Total Population
B01002_001 Median Age
B01001_002 Male Population
B01001_026 Female Population
B03002_003 Non-Hispanic White Population
B03002_004 Black or African American Population
B03002_012 Hispanic or Latino Population
B03002_006 Asian Population
  1. Socioeconomic Status (SES) & Income Inequality
🔸 Use Case: Economic disparities, income inequality, housing affordability.
Variable Name Description
B19013_001 Median Household Income
B19001_002 – B19001_017 Household Income Brackets
B17001_002 Population Below Poverty Level
B19301_001 Per Capita Income
B25077_001 Median Home Value
B25064_001 Median Gross Rent
  1. Education & Employment
🔸 Use Case: Workforce analysis, educational attainment, economic mobility.
Variable Name Description
B15003_022 Bachelor’s Degree or Higher
B15003_017 High School Graduates
B23025_002 Civilian Labor Force
B23025_005 Unemployment Count
B23025_007 Unemployment Rate
  1. Housing & Urban Development
🔸 Use Case: Gentrification studies, housing affordability, displacement trends.
Variable Name Description
B25077_001 Median Home Value
B25064_001 Median Gross Rent
B25034_001 Housing Units by Year Built
B25002_003 Vacant Housing Units
B25024_002 Single-Family Homes
B25024_010 Mobile Homes
  1. Health & Disability
🔸 Use Case: Healthcare access, disability prevalence, environmental health.
Variable Name Description
B18101_001 Population with a Disability
B27010_001 Population with Health Insurance Coverage
B27010_017 Population Uninsured
B25099_001 Households with No Plumbing Facilities

Using state bins

This is another approach for data visualization of state data where we ditch the state’s natural shape to emphasize comparisons among them.

# Install and load 'statebins'
# install.packages('statebins')
library(statebins)


# Using the income dataframe from earlier, let's add income levels
US_income <- mutate(
  income_sf,
  income_bins = cut(
    ifelse(is.na(estimate), 25000, estimate),
    breaks = c(0, 40000, 50000, 60000, 70000, 80000, 150000),
    labels = c("< $40k", "$40k to $50k", 
               "$50k to $60k", "$60k to $70k", "$70k - $80k", " > $80k"),
    right = FALSE
  )
)

# Plotting
US_income %>% 
  ggplot() +
  statebins::geom_statebins(aes(geometry = geometry,
                                state = NAME, 
                                fill = income_bins)) +
  scale_fill_viridis_d() +
  theme_statebins() +
  labs(title = "Median Household Income by State (2023)",
       caption = "Source: U.S. Census Bureau ACS 2023. Note: Puerto Rico not included")

Advanced example: using multiple variables at once

# Let's gather data for income and poverty for states in 2023
data <- get_acs(geography = "state", 
                variables = c(income = "B19013_001", 
                              poverty = "B17001_002",
                              population = "B01003_001"),
                year = 2023,
                geometry = TRUE,
                progress_bar = FALSE) %>% 
  shift_geometry()

head(data)
Simple feature collection with 6 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -1233579 ymin: -614973.5 xmax: -35340.83 ymax: 970427.9
Projected CRS: USA_Contiguous_Albers_Equal_Area_Conic
  GEOID         NAME   variable estimate  moe                       geometry
1    35   New Mexico population  2114768   NA MULTIPOLYGON (((-1231344 -5...
2    35   New Mexico    poverty   375381 7817 MULTIPOLYGON (((-1231344 -5...
3    35   New Mexico     income    62125  608 MULTIPOLYGON (((-1231344 -5...
4    46 South Dakota population   899194   NA MULTIPOLYGON (((-633765.6 8...
5    46 South Dakota    poverty   104511 3107 MULTIPOLYGON (((-633765.6 8...
6    46 South Dakota     income    72421  805 MULTIPOLYGON (((-633765.6 8...

Notice how our variables are all listed under the variable column. For our analyses, it would be fruitful to have the variables as their own columns.

To do this, we will pivot the data long-ways so that each variable is its own column.

# Transpose the data so each variable has its own column}
data_wide <- data %>%
  select(GEOID,
         state = NAME, 
         variable, 
         estimate, 
         geometry) %>%  # Keep relevant columns
  distinct(GEOID, 
           state, 
           variable, 
           estimate, 
           .keep_all = TRUE) %>%  # Remove duplicate rows
  pivot_wider(
    names_from = variable,
    values_from = estimate
  ) %>% 
  mutate(poverty_rate = (poverty / population) * 100)  # Compute Poverty Rate

# Let's look at our new data set:
head(data_wide)
Simple feature collection with 6 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -2356114 ymin: -782290.7 xmax: 1419556 ymax: 970427.9
Projected CRS: USA_Contiguous_Albers_Equal_Area_Conic
# A tibble: 6 × 7
  GEOID state                    geometry population poverty income poverty_rate
  <chr> <chr>          <MULTIPOLYGON [m]>      <dbl>   <dbl>  <dbl>        <dbl>
1 35    New Me… (((-1231344 -588235.8, -…    2114768  375381  62125         17.8
2 46    South … (((-633765.6 865552.6, -…     899194  104511  72421         11.6
3 06    Califo… (((-2066923 -203083.1, -…   39242785 4610600  96334         11.7
4 21    Kentuc… (((584560 -88603.17, 585…    4510725  707480  62417         15.7
5 01    Alabama (((760323.7 -749311.5, 7…    5054253  768185  62027         15.2
6 13    Georgia (((1390722 -584139.2, 13…   10822590 1423159  74664         13.1
# Install package`RColorBrewer`
# install.packages('RColorBrewer')

library(RColorBrewer)  # For color scales

# Define bivariate color scale using ColorBrewer
bivariate_palette <- c(
  "Low-Low" = brewer.pal(9, "PuBuGn")[1],  # Light green
  "Low-Medium" = brewer.pal(9, "PuBuGn")[3],
  "Low-High" = brewer.pal(9, "PuBuGn")[5],
  "Medium-Low" = brewer.pal(9, "Purples")[3],
  "Medium-Medium" = brewer.pal(9, "Purples")[5],
  "Medium-High" = brewer.pal(9, "Purples")[7],
  "High-Low" = brewer.pal(9, "Oranges")[3],
  "High-Medium" = brewer.pal(9, "Oranges")[5],
  "High-High" = brewer.pal(9, "Oranges")[7]   # Dark orange
)


# Assign colors to the dataset
data_wide <- data_wide %>%
  mutate(
    income_cat = cut(income, 
                     breaks = quantile(income, 
                                       probs = seq(0, 1, length.out = 4), 
                                       na.rm = TRUE), 
                     labels = c("Low", "Medium", "High"), 
                     include.lowest = TRUE),
    poverty_cat = cut(poverty_rate, 
                      breaks = quantile(poverty_rate, 
                                        probs = seq(0, 1, length.out = 4), 
                                        na.rm = TRUE), 
                      labels = c("Low", "Medium", "High"), 
                      include.lowest = TRUE),
    category = paste(income_cat, 
                     poverty_cat, 
                     sep = "-")  # Create bivariate categories
  )

How This Works

Each state (or region) falls into one of nine possible categories based on:

  • Income level: Low, Medium, or High
  • Poverty rate: Low, Medium, or High

Each category is assigned a unique color using three different ColorBrewer palettes:

  • PuBuGn (Blue-Green shades) → Used for Low Income
  • Purples (Purple shades) → Used for Medium Income
  • Oranges (Orange shades) → Used for High Income

Darker colors indicate higher poverty rates, making the map intuitive:

  • “Low-Low” (Low income, Low poverty) → Lightest Green
  • “Low-High” (Low income, High poverty) → Darker Green
  • “High-Low” (High income, Low poverty) → Light Orange
  • “High-High” (High income, High poverty) → Darkest Orange

Breakdown of Color Assignments

I chose the colors using the link for the Brewer Color Palettes.

Category (Income-Poverty) Color Palette Used Shade
Low-Low (Low income, Low poverty) PuBuGn (Blue-Green) Lightest Green ([1])
Low-Medium (Low income, Medium poverty) PuBuGn Medium Green ([3])
Low-High (Low income, High poverty) PuBuGn Darker Green ([5])
Medium-Low (Medium income, Low poverty) Purples Light Purple ([3])
Medium-Medium (Medium income, Medium poverty) Purples Medium Purple ([5])
Medium-High (Medium income, High poverty) Purples Darker Purple ([7])
High-Low (High income, Low poverty) Oranges Light Orange ([3])
High-Medium (High income, Medium poverty) Oranges Medium Orange ([5])
High-High (High income, High poverty) Oranges Darkest Orange ([7])
# Create plot

ggplot(data_wide) +
  geom_sf(aes(fill = category), 
          color = "white", 
          size = 0.2) +
  scale_fill_manual(values = bivariate_palette, 
                    name = "Income-Poverty Scale") +  # Add legend
  theme_void() +
  labs(
    title = "Income & Poverty Rate by U.S. State (2023)",
    subtitle = "Bivariate Choropleth Map",
    caption = "Source: U.S. Census Bureau ACS 2023"
  )

Using Census data with other data sources

Census data can be used alongside other political data sources (such as ANES, election data, or even your own survey or qualitative data)!

Let’s use the income-poverty data and compare it to red vs blue states.

First let’s make a data frame with the states and their respective 2024 vote

# Using data from Politico on 2024 Presidential election

#red_states ~ "#fb5c44"
#blue_states ~ "#336fba"

state_colors <- c("Idaho" = "#fb5c44",
                "Montana" = "#fb5c44",
                "North Dakota" = "#fb5c44",
                "South Dakota" = "#fb5c44",
                "Wyoming" = "#fb5c44",
                "Utah" = "#fb5c44",
                "Nebraska" = "#fb5c44",
                "Kansas" = "#fb5c44",
                "Oklahoma" = "#fb5c44",
                "Texas" = "#fb5c44",
                "Louisiana" = "#fb5c44",
                "Mississippi" = "#fb5c44",
                "Alabama" = "#fb5c44",
                "Iowa" = "#fb5c44",
                "Missouri" = "#fb5c44",
                "Arkansas" = "#fb5c44",
                "Tennessee" = "#fb5c44",
                "Kentucky" = "#fb5c44",
                "Indiana" = "#fb5c44",
                "Ohio" = "#fb5c44",
                "Nevada" = "#fb5c44",
                "Arizona" = "#fb5c44",
                "Pennsylvania" = "#fb5c44",
                "Georgia" = "#fb5c44",
                "Michigan" = "#fb5c44",
                "Wisconsin" = "#fb5c44",
                "West Virginia" = "#fb5c44",
                "North Carolina" = "#fb5c44",
                "South Carolina" = "#fb5c44",
                "Florida" = "#fb5c44",
                "Alaska" = "#fb5c44",
                "Washington" = "#336fba",
                 "Oregon" = "#336fba",
                 "California" = "#336fba",
                 "Colorado" = "#336fba",
                 "New Mexico" = "#336fba",
                 "Minnesota" = "#336fba",
                 "Massachusetts" = "#336fba",
                 "Illinois" = "#336fba",
                 "Virginia" = "#336fba",
                 "Maryland" = "#336fba",
                 "District Of Columbia" = "#336fba",
                 "Connecticut" = "#336fba",
                 "Rhode Island" = "#336fba",
                 "Delaware" = "#336fba",
                 "Maine" = "#336fba",
                 "New York" = "#336fba",
                 "New Jersey" = "#336fba",
                 "Hawaii" = "#336fba",
                 "New Hampshire" = "#336fba",
                 "Vermont" = "#336fba"
                 )

color_vote <- as.data.frame(state_colors)

color_vote <- tibble::rownames_to_column(color_vote, "state")

Next, let’s modify the data set to sort by top poverty rate and income, respectively, and make two different plots

# Let's plot poverty rate and income
poverty_colors <- 
  # let's join the data sets while also sorting by top poverty rate
  data_wide %>% 
  inner_join(color_vote, by = "state") %>% 
  filter(state != "Puerto Rico") %>% 
  mutate_if(is.numeric, round, 2) %>% 
  top_n(10, poverty_rate) %>% 
  # Now we can plot
  ggplot(aes(x = reorder(state, poverty_rate), 
               y = poverty_rate,
             fill = state_colors)) +
  geom_col() +
  scale_fill_manual(
    name = "2024 Presidential Election Result",
    labels = c("Blue", "Red"),
    values = c("#336fba","#fb5c44"),
    guide = "none"
  ) +
  coord_flip() +
  geom_text(aes(label = poverty_rate), 
            color = "white",
            hjust = 1.1,
            size = 3) +
  theme_bw()  +
  labs(
    y = "Poverty Rate",
    title = 'Top 10 states by poverty rate\n& 2024 Presidential vote'
  ) +
  theme(
        axis.title.y = element_blank(),
        axis.ticks.x = element_blank())

income_colors <-  
  # let's join the data sets while also sorting by top income
  data_wide %>% 
  inner_join(color_vote, by = "state") %>% 
  filter(state != "Puerto Rico") %>% 
  mutate_if(is.numeric, round, 2) %>% 
  top_n(10, income) %>% 
  # Now we can plot
  ggplot(aes(x = reorder(state, income), 
               y = income,
             fill = state_colors)) +
  geom_col() +
  scale_fill_manual(
    name = "2024 Presidential Election Result",
    labels = c("Blue", "Red"),
    values = c("#336fba","#fb5c44"),
    guide = "none"
  ) +
  scale_y_continuous(labels = scales::comma) +
  coord_flip() +
  geom_text(aes(label = scales::comma(income)), 
            color = "white",
            hjust = 1.1,
            size = 3) +
  theme_bw()  +
  labs(
    y = "Median Household Income",
    title = 'Top 10 states by income\n& 2024 Presidential vote',
    caption = "\nU.S. Census Bureau 2023 ACS 5-Year Population Estimates \n Politico 2024 Election Mapping."
  ) +
  theme(
        axis.title.y = element_blank(),
        axis.ticks.x = element_blank())

Finally, let’s view these plots side by side and see if there are interesting conclusions we can make about the interactions between poverty/income and 2024 vote choice.

# Install patchwork to view these side by side
# install.packages('patchwork')
library(patchwork)

poverty_colors + income_colors

YOUR TURN 🙌 : Using the color_vote data from the 2024 election, find another variable you are interested in to visualize along with vote choice.

Footnotes

  1. Source: Analyzing US Census Data: Methods, Maps, and Models in R by Kyle Walker↩︎