Missing Data

POLI_SCI 403: Probability and Statistics

Agenda

Identification
Mising data: Stable outcomes framework
Approaches to missing outcome data
Data example (if enough time)
Multiple imputation (lab)

Identification

So far

Statistical inference: From observed to unobserved

Needs i.i.d. assumption to connect point estimates to population quantities

Alternative approach: We want to tell a story about what we observe that is true despite of what remains unobserved

Examples?

This is the domain of nonparametric identification

Identification

A “model” is identifiable if it is theoretically possible to learn its true values
Example: If we can credibly approximate $E(X)$ with the sample mean, then we can say $E(X)$ is identified
Validity: An estimator is valid if we believe that it measures what it claims to measure
Validity implies low bias and consistency
But remember these properties depend on the data!

Point-identification

A statistical functional of a random vector

Point-identification

A statistical functional of a random vector is point-identified

Point-identification

A statistical functional of a random vector is point-identified by a set of assumptions

Point-identification

A statistical functional of a random vector is point-identified by a set of assumptions if there exists one and only one value of that statistical functional

Point-identification

A statistical functional of a random vector is point-identified by a set of assumptions if there exists one and only one value of that statistical functional that is logically compatible with the CDF of the unobservable vector given the assumptions.

Missing Data

Should we worry?

Missing data problem

Unit	$Y_i^*$	$R_i$	$Y_i$
1	1	1	1
2	-99	0	0
3	1	1	1
4	0	1	0
5	1	1	1
6	-99	0	0

Missing data problem

Unit	$Y_i^*$	$R_i$	$Y_i$
1	1	1	1
2	-99	0	0
3	1	1	1
4	0	1	0
5	1	1	1
6	-99	0	0

Stable outcomes framework

$Y_i$: True outcome (unobserved)
$R_i$: Recorded (1) or missing (0)
$Y_i^*$: Observed outcome
$Y_i^* = Y_i R_i + (-99) (1-R_i)$
$E[Y_i] = 3/6$; $E[Y_i^*] = 3/4$

Challenge: $E[Y_i]$ is not point-identified

What can we do?

Make minimal assumptions
Make heroic assumptions

Minimal assumption: $Y_i$ is bounded

Unit	$Y_i^*$	$R_i$
1	1	1
2	-99	0
3	1	1
4	0	1
5	1	1
6	-99	0

Plug-in worst case scenario to find lower and upper bound

Minimal assumption: $Y_i$ is bounded

Unit	$Y_i^*$	$R_i$	$Y_i^L$	$Y_i^U$
1	1	1	1	1
2	-99	0	0	1
3	1	1	1	1
4	0	1	0	0
5	1	1	1	1
6	-99	0	0	1

Plug-in worst case scenario to find lower and upper bound

Minimal assumption: $Y_i$ is bounded

Unit	$Y_i^*$	$R_i$	$Y_i^L$	$Y_i^U$
1	1	1	1	1
2	-99	0	0	1
3	1	1	1	1
4	0	1	0	0
5	1	1	1	1
6	-99	0	0	1

Plug-in worst case scenario to find lower and upper bound

Minimal assumption: $Y_i$ is bounded

Unit	$Y_i^*$	$R_i$	$Y_i^L$	$Y_i^U$
1	1	1	1	1
2	-99	0	0	1
3	1	1	1	1
4	0	1	0	0
5	1	1	1	1
6	-99	0	0	1

Plug-in worst case scenario to find lower and upper bound

Minimal assumption: $Y_i$ is bounded

Unit	$Y_i^*$	$R_i$	$Y_i^L$	$Y_i^U$
1	1	1	1	1
2	-99	0	0	1
3	1	1	1	1
4	0	1	0	0
5	1	1	1	1
6	-99	0	0	1

Sharp bounds: $\widehat E[Y_i] = [\frac{3}{6}, \frac{5}{6}]$

More formally

You can write plug-in estimator as a weighted average

\[ \textbf{Lower bound: } \frac{1}{n}\sum_{i=1}^n (Y_i^*R_i + \mathbf{a} (1-R_i)) \]

\[ \textbf{Upper bound: } \frac{1}{n}\sum_{i=1}^n (Y_i^*R_i + \mathbf{b} (1-R_i)) \]

Meaning you do not need to know which are missing observations, just how many!

Why sharp bounds?

They are sharp because we cannot improve them without additional assumptions that do not follow from the data

So we say $\widehat E[Y_i]$ is partially identified under these assumptions

Size of bounds is proportional to amount of missing data

And sometimes you can use auxiliary information to refine bounds

Challenge: Social scientists don’t know what to do with bounds

More heroic assumptions

Baseline: Missing not at random (MNAR/NMAR)

\[Y_i \not\!\perp\!\!\!\perp R_i\]

MCAR: Missing completely at random

\[ Y_i \perp \!\!\! \perp R_i \]

MAR: Missing at random after conditioning on covariates (aka ignorability)

\[ Y_i \perp \!\!\! \perp R_i | \mathbf{X}_i \]

MCAR

Unit	$Y_i^*$	$R_i$
1	1	1
2	-99	0
3	1	1
4	0	1
5	1	1
6	-99	0

We assume the distribution of outcomes is the same for observed and unobserved

MCAR

Unit	$Y_i^*$	$R_i$
1	1	1
2	-99	0
3	1	1
4	0	1
5	1	1
6	-99	0

Which is equivalent to imputing missing obs with the sample mean

MCAR

Unit	$Y_i^*$	$R_i$	$Y_i$
1	1	1	1
2	-99	0	$\widehat E[Y_i] = 3/4$
3	1	1	1
4	0	1	0
5	1	1	1
6	-99	0	$\widehat E[Y_i] = 3/4$

Which is equivalent to imputing missing obs with the sample mean

MAR

Unit	$Y_i^*$	$R_i$
1	1	1
2	-99	0
3	1	1
4	0	1
5	1	1
6	-99	0

We assume conditional independence

MAR

Unit	$Y_i^*$	$R_i$
1	1	1
2	-99	0
3	1	1
4	0	1
5	1	1
6	-99	0

Imagine we now observe covariate $X_i$

MAR

Unit	$Y_i^*$	$R_i$	$X_i$
1	1	1	0
2	-99	0	0
3	1	1	0
4	0	1	0
5	1	1	1
6	-99	0	1

Imagine we now observe covariate $X_i$

MAR

Unit	$Y_i^*$	$R_i$	$X_i$
1	1	1	0
2	-99	0	0
3	1	1	0
4	0	1	0
5	1	1	1
6	-99	0	1

MAR lets us impute the sample mean of units with the same covariate values

MAR

Unit	$Y_i^*$	$R_i$	$X_i$	$Y_i$
1	1	1	0	1
2	-99	0	0	$\widehat E[Y_i\|X_i=0] = 2/3$
3	1	1	0	1
4	0	1	0	0
5	1	1	1	1
6	-99	0	1	$\widehat E[Y_i\|X_i=1] = 1$

We can impute the sample mean of units with the same covariate values

MAR with multiple covariates

Unit	$Y_i^*$	$R_i$
1	1	1
2	-99	0
3	1	1
4	0	1
5	1	1
6	-99	0

Imagine we observe covariates $X_{1i}$ and $X_{2i}$

MAR with multiple covariates

Unit	$Y_i^*$	$R_i$	$X_{1i}$	$X_{2i}$
1	1	1	0	3
2	-99	0	0	7
3	1	1	0	9
4	0	1	0	5
5	1	1	1	4
6	-99	0	1	3

Imagine we observe covariates $X_{1i}$ and $X_{2i}$

Options

Regression estimation
Hot deck imputation
Inverse-probability weighting (IPW)

Regression estimation

Unit	$Y_i^*$	$R_i$	$X_{1i}$	$X_{2i}$
1	1	1	0	3
2	-99	0	0	7
3	1	1	0	9
4	0	1	0	5
5	1	1	1	4
6	-99	0	1	3

Estimate $\widehat Y_i^* = \widehat \beta_0 + \widehat \beta_1 X_{1i} + \widehat \beta_2 X_{2i}$ with observed data

Regression estimation

Unit	$Y_i^*$	$R_i$	$X_{1i}$	$X_{2i}$
1	1	1	0	3
2	-99	0	0	7
3	1	1	0	9
4	0	1	0	5
5	1	1	1	4
6	-99	0	1	3

Use coefficients to impute missing values

Regression estimation

Unit	$Y_i^*$	$R_i$	$X_{1i}$	$X_{2i}$	$Y_i$
1	1	1	0	3	1
2	-99	0	0	7	$\widehat \beta_0 + \widehat \beta_1 \cdot 0 + \widehat \beta_2 \cdot 7$
3	1	1	0	9	1
4	0	1	0	5	0
5	1	1	1	4	1
6	-99	0	1	3	$\widehat \beta_0 + \widehat \beta_1 \cdot 1 + \widehat \beta_2 \cdot 3$

Which yields the same $\widehat E[Y_i]$ as

Regression estimation

Unit	$Y_i^*$	$R_i$	$X_{1i}$	$X_{2i}$	$Y_i$
1	1	1	0	3	$\widehat \beta_0 + \widehat \beta_1 \cdot 0 + \widehat \beta_2 \cdot 3$
2	-99	0	0	7	$\widehat \beta_0 + \widehat \beta_1 \cdot 0 + \widehat \beta_2 \cdot 7$
3	1	1	0	9	$\widehat \beta_0 + \widehat \beta_1 \cdot 0 + \widehat \beta_2 \cdot 9$
4	0	1	0	5	$\widehat \beta_0 + \widehat \beta_1 \cdot 0 + \widehat \beta_2 \cdot 5$
5	1	1	1	4	$\widehat \beta_0 + \widehat \beta_1 \cdot 1 + \widehat \beta_2 \cdot 4$
6	-99	0	1	3	$\widehat \beta_0 + \widehat \beta_1 \cdot 1 + \widehat \beta_2 \cdot 3$

What if we could condition on a function of covariates instead?

Propensity score

Response propensity function

\[ p_R(\mathbf{x}) = Pr[R_i = 1 | \mathbf{X}_i = x] \]

$p_R(\mathbf{X}_i)$ is the propensity score of unit $i$
Conditional probability of response given the covariates
Useful since you cannot condition on many covariates nonparametrically
Can estimate consistently via ML (usually logit)
Still need to assume specification is correct

Hot deck imputation

Unit	$Y_i^*$	$R_i$	$X_{1i}$	$X_{2i}$
1	2	1	0	3
2	-99	0	0	7
3	3	1	0	9
4	10	1	0	5
5	12	1	1	4
6	-99	0	1	3

Add propensity score

Hot deck imputation

Unit	$Y_i^*$	$R_i$	$X_{1i}$	$X_{2i}$	$p_R(\mathbf{X}_i)$
1	2	1	0	3	0.33
2	-99	0	0	7	0.14
3	3	1	0	9	0.73
4	10	1	0	5	0.35
5	12	1	1	4	0.78
6	-99	0	1	3	0.70

Add propensity score

Hot deck imputation

Unit	$Y_i^*$	$R_i$	$X_{1i}$	$X_{2i}$	$p_R(\mathbf{X}_i)$
1	2	1	0	3	0.33
2	-99	0	0	7	0.14
3	3	1	0	9	0.73
4	10	1	0	5	0.35
5	12	1	1	4	0.78
6	-99	0	1	3	0.70

Impute values with nearest propensity score

Hot deck imputation

Unit	$Y_i^*$	$R_i$	$X_{1i}$	$X_{2i}$	$p_R(\mathbf{X}_i)$	$Y_i$
1	2	1	0	3	0.33	2
2	-99	0	0	7	0.14	$\widehat E[Y_i\|p_R(X_i) = 0.14]$
3	3	1	0	9	0.73	3
4	10	1	0	5	0.35	10
5	12	1	1	4	0.78	12
6	-99	0	1	3	0.70	$\widehat E[Y_i\|p_R(X_i) = 0.70]$

Impute values with nearest propensity score

Hot deck imputation

Unit	$Y_i^*$	$R_i$	$X_{1i}$	$X_{2i}$	$p_R(\mathbf{X}_i)$	$Y_i$
1	2	1	0	3	0.33	2
2	-99	0	0	7	0.14	2
3	3	1	0	9	0.73	3
4	10	1	0	5	0.35	10
5	12	1	1	4	0.78	12
6	-99	0	1	3	0.70	3

Impute values with nearest propensity score

Hot deck what?

Hot deck imputation

Unit	$Y_i^*$	$R_i$	$X_{1i}$	$X_{2i}$	$p_R(\mathbf{X}_i)$	$Y_i$
1	2	1	0	3	0.33	2
2	-99	0	0	7	0.14	2
3	3	1	0	9	0.73	3
4	10	1	0	5	0.35	10
5	12	1	1	4	0.78	12
6	-99	0	1	3	0.70	3

Impute values with nearest propensity score

Hot deck imputation

Unit	$Y_i^*$	$R_i$	$X_{1i}$	$X_{2i}$	$p_R(\mathbf{X}_i)$	$Y_i$
1	2	1	0	3	0.33	2
2	-99	0	0	7	0.14	2
3	3	1	0	9	0.73	3
4	10	1	0	5	0.35	10
5	12	1	1	4	0.78	12
6	-99	0	1	3	0.70	3

Not a good idea. Discards a lot of information!

Inverse probability weighting (IPW)

The problem is not missing data
But an unrepresentative sample
If MAR holds, we can use propensity score to weight observations
Inverse probability weight $\frac{1}{p_R(\mathbf{X}_i)}$

IPW estimator

Horvitz-Thompson estimator

\[ \widehat E_{IPW}[Y_i] = \frac{1}{n} \sum_{i=1}^n \frac{Y_i^* R_i}{\widehat p_R(\mathbf{X}_i)} \]

Low propensity scores are underrepresented $\rightarrow$ Assign greater weight
High-propensity scores $\rightarrow$ lower weight

Recap

Options under MAR

Regression estimation
Hot deck imputation
Inverse-probability weighting (IPW)

Recap

Options under MAR

Regression estimation
~~Hot deck imputation~~
Inverse-probability weighting (IPW)

Recap

Options under MAR

Regression estimation
~~Hot deck imputation~~
Inverse-probability weighting (IPW)
Doubly robust estimation

Doubly robust estimation

Regression specification may be incorrect
Propensity score specification may be incorrect
Combining the two in a model that yields consistent estimates when either is correct

\[ \begin{align} & \widehat E_{DR}[Y_i] = \\ &\frac{1}{n} \sum_{i=1}^n \widehat E[Y_i^*|R_i = 1, \mathbf{X}_i] + \frac{1}{n} \sum_{i=1}^n \frac{R_i(Y_i^* - \widehat E[Y_i^* | R = 1, \mathbf{X}_i])}{\widehat{p}_R(\mathbf{X}_i)} \end{align} \]

Doubly robust estimation

Regression specification may be incorrect
Propensity score specification may be incorrect
Combining the two in a model that yields consistent estimates when either is correct

\[ \widehat E_{DR}[Y_i] = \text{Regression estimator} + \text{IPW correction} \]

With data

library(gssr)

gss22 = gss_get_yr(2022)

gss = gss22 %>% 
  select(vote20) %>% 
  mutate(vote = ifelse(vote20 == 1, 1, 0))

gss %>% 
  group_by(vote) %>% 
  tally()

# A tibble: 3 × 2
   vote     n
  <dbl> <int>
1     0   998
2     1  2878
3    NA   273

Bounds as weighted average

Get proportion missing

pr_response = mean(!is.na(gss$vote))

pr_response

[1] 0.934201

Lower bound:

mean(gss$vote, na.rm = TRUE) * pr_response + 0 *(1-pr_response)

[1] 0.6936611

Upper bound:

mean(gss$vote, na.rm = TRUE) * pr_response + 1 * (1-pr_response)

[1] 0.7594601

Bounds by imputation

gss %>% 
  select(vote) %>% 
  mutate(
    vote_lo = ifelse(is.na(vote), 0, vote),
    vote_hi = ifelse(is.na(vote), 1, vote)) %>% 
  summarize(
    complete_cases = mean(vote, na.rm = TRUE),
    lower_bound = mean(vote_lo),
    upper_bound = mean(vote_hi)
  )

# A tibble: 1 × 3
  complete_cases lower_bound upper_bound
           <dbl>       <dbl>       <dbl>
1          0.743       0.694       0.759

MCAR by imputation

vote_mean = mean(gss$vote, na.rm = TRUE)

gss %>% 
  select(vote) %>% 
  mutate(
    vote_mcar = ifelse(is.na(vote), vote_mean, vote)) %>% 
  summarize(
    complete_cases = mean(vote, na.rm = TRUE),
    mcar = mean(vote_mcar)
  )

# A tibble: 1 × 2
  complete_cases  mcar
           <dbl> <dbl>
1          0.743 0.743

Regression imputation (MAR)

Make a new data with two more variables

gss_wide = gss22 %>% 
  select(vote20, sex, partyid) %>% 
  mutate(vote = ifelse(vote20 == 1, 1, 0),
         female = ifelse(sex == 2, 1, 0),
         dem = ifelse(partyid <= 2, 1, 0)) %>% 
  select(vote, female, dem)

head(gss_wide, 3)

# A tibble: 3 × 3
   vote female   dem
  <dbl>  <dbl> <dbl>
1     1      1     1
2     1      0     0
3     1      1     0

Regression imputation (MAR)

outcome_model = lm(vote ~ female + dem,
                   data = gss_wide)

gss_wide$vote_pred = predict(outcome_model, 
                    # newdata is by default current data in predict.lm
                    newdata = gss_wide)

head(gss_wide)

# A tibble: 6 × 4
   vote female   dem vote_pred
  <dbl>  <dbl> <dbl>     <dbl>
1     1      1     1     0.842
2     1      0     0     0.663
3     1      1     0     0.669
4     1      1     1     0.842
5     1      0     0     0.663
6     1      0     1     0.836

Regression imputation (MAR)

gss_wide %>% 
  mutate(
    vote_reg = ifelse(is.na(vote), vote_pred, vote)
  ) %>% 
  summarize(
    complete_cases = mean(vote, na.rm = TRUE),
    # Some still missing have missing data in the covariates
    mar_reg_1 = mean(vote_reg, na.rm = TRUE),
    mar_reg_2 = mean(vote_pred, na.rm = TRUE)
  )

# A tibble: 1 × 3
  complete_cases mar_reg_1 mar_reg_2
           <dbl>     <dbl>     <dbl>
1          0.743     0.740     0.741

IPW (MAR)

# Make a binary indicator of missingness
gss_wide = gss_wide %>% 
  mutate(
    response = ifelse(!is.na(vote), 1, 0)
  ) %>% 
  filter(!is.na(female) & !is.na(dem))

pscore_model = glm(
  response ~ female + dem,
  family = binomial,
  data = gss_wide
)

# Create variable with probabilities
gss_wide$pscore = predict(pscore_model, type = "response")

# Convert to IPW
gss_wide$ipw = 1/gss_wide$pscore

IPW (MAR)

gss_wide %>% 
  summarize(
    complete_cases = mean(vote, na.rm = TRUE),
    mar_ipw = weighted.mean(vote, w = ipw, na.rm = TRUE)
    )

# A tibble: 1 × 2
  complete_cases mar_ipw
           <dbl>   <dbl>
1          0.744   0.741

Doubly robust (MAR)

doubly_robust = lm(vote ~ female + dem,
                   data = gss_wide,
                   weights = ipw)

gss_wide$vote_dr = predict(doubly_robust,
                           newdata = gss_wide)

gss_wide %>% 
  summarize(
    complete_cases = mean(vote, na.rm = TRUE),
    mar_dr = mean(vote_dr)
  )

# A tibble: 1 × 2
  complete_cases mar_dr
           <dbl>  <dbl>
1          0.744  0.741

Wrapping up

Missing data means i.i.d. assumption does not hold even with a random/representative sample
But i.i.d. is assumed out of convenience, not accuracy
Many are unbothered by missing data because it does not make a difference
But sometimes ignoring may present problems for inference or the application of other methods
You want to (at least) show that ignoring missing data is not a big deal

Missing Data

Agenda

Identification

So far

Identification

Point-identification

Point-identification

Point-identification

Point-identification

Point-identification

Missing Data

Missing data problem

Missing data problem

What can we do?

Minimal assumption: \(Y_i\) is bounded

Minimal assumption: \(Y_i\) is bounded

Minimal assumption: \(Y_i\) is bounded

Minimal assumption: \(Y_i\) is bounded

Minimal assumption: \(Y_i\) is bounded

More formally

Why sharp bounds?

More heroic assumptions

MCAR

MCAR

MCAR

MAR

MAR

MAR

MAR

MAR

MAR with multiple covariates

MAR with multiple covariates

Options

Regression estimation

Regression estimation

Regression estimation

Regression estimation

Propensity score

Hot deck imputation

Hot deck imputation

Hot deck imputation

Hot deck imputation

Hot deck imputation

Hot deck what?

Hot deck imputation

Hot deck imputation

Inverse probability weighting (IPW)

IPW estimator

Recap

Recap

Recap

Doubly robust estimation

Doubly robust estimation

With data

Bounds as weighted average

Bounds by imputation

MCAR by imputation

Regression imputation (MAR)

Regression imputation (MAR)

Regression imputation (MAR)

IPW (MAR)

IPW (MAR)

Doubly robust (MAR)

Wrapping up