| Unit | $Y_i^*$ | $R_i$ | $Y_i$ |
|---|---|---|---|
| 1 | 1 | 1 | 1 |
| 2 | -99 | 0 | 0 |
| 3 | 1 | 1 | 1 |
| 4 | 0 | 1 | 0 |
| 5 | 1 | 1 | 1 |
| 6 | -99 | 0 | 0 |
POLI_SCI 403: Probability and Statistics
Identification
Mising data: Stable outcomes framework
Approaches to missing outcome data
Data example (if enough time)
Multiple imputation (lab)
Statistical inference: From observed to unobserved
Needs i.i.d. assumption to connect point estimates to population quantities
Alternative approach: We want to tell a story about what we observe that is true despite of what remains unobserved
Examples?
This is the domain of nonparametric identification
A “model” is identifiable if it is theoretically possible to learn its true values
Example: If we can credibly approximate \(E(X)\) with the sample mean, then we can say \(E(X)\) is identified
Validity: An estimator is valid if we believe that it measures what it claims to measure
Validity implies low bias and consistency
But remember these properties depend on the data!
A statistical functional of a random vector
A statistical functional of a random vector is point-identified
A statistical functional of a random vector is point-identified by a set of assumptions
A statistical functional of a random vector is point-identified by a set of assumptions if there exists one and only one value of that statistical functional
A statistical functional of a random vector is point-identified by a set of assumptions if there exists one and only one value of that statistical functional that is logically compatible with the CDF of the unobservable vector given the assumptions.
Should we worry?
| Unit | $Y_i^*$ | $R_i$ | $Y_i$ |
|---|---|---|---|
| 1 | 1 | 1 | 1 |
| 2 | -99 | 0 | 0 |
| 3 | 1 | 1 | 1 |
| 4 | 0 | 1 | 0 |
| 5 | 1 | 1 | 1 |
| 6 | -99 | 0 | 0 |
| Unit | $Y_i^*$ | $R_i$ | $Y_i$ |
|---|---|---|---|
| 1 | 1 | 1 | 1 |
| 2 | -99 | 0 | 0 |
| 3 | 1 | 1 | 1 |
| 4 | 0 | 1 | 0 |
| 5 | 1 | 1 | 1 |
| 6 | -99 | 0 | 0 |
Stable outcomes framework
\(Y_i\): True outcome (unobserved)
\(R_i\): Recorded (1) or missing (0)
\(Y_i^*\): Observed outcome
\(Y_i^* = Y_i R_i + (-99) (1-R_i)\)
\(E[Y_i] = 3/6\); \(E[Y_i^*] = 3/4\)
Challenge: \(E[Y_i]\) is not point-identified
Make minimal assumptions
Make heroic assumptions
| Unit | $Y_i^*$ | $R_i$ |
|---|---|---|
| 1 | 1 | 1 |
| 2 | -99 | 0 |
| 3 | 1 | 1 |
| 4 | 0 | 1 |
| 5 | 1 | 1 |
| 6 | -99 | 0 |
Plug-in worst case scenario to find lower and upper bound
| Unit | $Y_i^*$ | $R_i$ | $Y_i^L$ | $Y_i^U$ |
|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 1 |
| 2 | -99 | 0 | 0 | 1 |
| 3 | 1 | 1 | 1 | 1 |
| 4 | 0 | 1 | 0 | 0 |
| 5 | 1 | 1 | 1 | 1 |
| 6 | -99 | 0 | 0 | 1 |
Plug-in worst case scenario to find lower and upper bound
| Unit | $Y_i^*$ | $R_i$ | $Y_i^L$ | $Y_i^U$ |
|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 1 |
| 2 | -99 | 0 | 0 | 1 |
| 3 | 1 | 1 | 1 | 1 |
| 4 | 0 | 1 | 0 | 0 |
| 5 | 1 | 1 | 1 | 1 |
| 6 | -99 | 0 | 0 | 1 |
Plug-in worst case scenario to find lower and upper bound
| Unit | $Y_i^*$ | $R_i$ | $Y_i^L$ | $Y_i^U$ |
|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 1 |
| 2 | -99 | 0 | 0 | 1 |
| 3 | 1 | 1 | 1 | 1 |
| 4 | 0 | 1 | 0 | 0 |
| 5 | 1 | 1 | 1 | 1 |
| 6 | -99 | 0 | 0 | 1 |
Plug-in worst case scenario to find lower and upper bound
| Unit | $Y_i^*$ | $R_i$ | $Y_i^L$ | $Y_i^U$ |
|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 1 |
| 2 | -99 | 0 | 0 | 1 |
| 3 | 1 | 1 | 1 | 1 |
| 4 | 0 | 1 | 0 | 0 |
| 5 | 1 | 1 | 1 | 1 |
| 6 | -99 | 0 | 0 | 1 |
Sharp bounds: \(\widehat E[Y_i] = [\frac{3}{6}, \frac{5}{6}]\)
You can write plug-in estimator as a weighted average
\[ \textbf{Lower bound: } \frac{1}{n}\sum_{i=1}^n (Y_i^*R_i + \mathbf{a} (1-R_i)) \]
\[ \textbf{Upper bound: } \frac{1}{n}\sum_{i=1}^n (Y_i^*R_i + \mathbf{b} (1-R_i)) \]
Meaning you do not need to know which are missing observations, just how many!
They are sharp because we cannot improve them without additional assumptions that do not follow from the data
So we say \(\widehat E[Y_i]\) is partially identified under these assumptions
Size of bounds is proportional to amount of missing data
And sometimes you can use auxiliary information to refine bounds
Challenge: Social scientists don’t know what to do with bounds
Baseline: Missing not at random (MNAR/NMAR)
\[Y_i \not\!\perp\!\!\!\perp R_i\]
MCAR: Missing completely at random
\[ Y_i \perp \!\!\! \perp R_i \]
MAR: Missing at random after conditioning on covariates (aka ignorability)
\[ Y_i \perp \!\!\! \perp R_i | \mathbf{X}_i \]
| Unit | $Y_i^*$ | $R_i$ |
|---|---|---|
| 1 | 1 | 1 |
| 2 | -99 | 0 |
| 3 | 1 | 1 |
| 4 | 0 | 1 |
| 5 | 1 | 1 |
| 6 | -99 | 0 |
We assume the distribution of outcomes is the same for observed and unobserved
| Unit | $Y_i^*$ | $R_i$ |
|---|---|---|
| 1 | 1 | 1 |
| 2 | -99 | 0 |
| 3 | 1 | 1 |
| 4 | 0 | 1 |
| 5 | 1 | 1 |
| 6 | -99 | 0 |
Which is equivalent to imputing missing obs with the sample mean
| Unit | $Y_i^*$ | $R_i$ | $Y_i$ |
|---|---|---|---|
| 1 | 1 | 1 | 1 |
| 2 | -99 | 0 | $\widehat E[Y_i] = 3/4$ |
| 3 | 1 | 1 | 1 |
| 4 | 0 | 1 | 0 |
| 5 | 1 | 1 | 1 |
| 6 | -99 | 0 | $\widehat E[Y_i] = 3/4$ |
Which is equivalent to imputing missing obs with the sample mean
| Unit | $Y_i^*$ | $R_i$ |
|---|---|---|
| 1 | 1 | 1 |
| 2 | -99 | 0 |
| 3 | 1 | 1 |
| 4 | 0 | 1 |
| 5 | 1 | 1 |
| 6 | -99 | 0 |
We assume conditional independence
| Unit | $Y_i^*$ | $R_i$ |
|---|---|---|
| 1 | 1 | 1 |
| 2 | -99 | 0 |
| 3 | 1 | 1 |
| 4 | 0 | 1 |
| 5 | 1 | 1 |
| 6 | -99 | 0 |
Imagine we now observe covariate \(X_i\)
| Unit | $Y_i^*$ | $R_i$ | $X_i$ |
|---|---|---|---|
| 1 | 1 | 1 | 0 |
| 2 | -99 | 0 | 0 |
| 3 | 1 | 1 | 0 |
| 4 | 0 | 1 | 0 |
| 5 | 1 | 1 | 1 |
| 6 | -99 | 0 | 1 |
Imagine we now observe covariate \(X_i\)
| Unit | $Y_i^*$ | $R_i$ | $X_i$ |
|---|---|---|---|
| 1 | 1 | 1 | 0 |
| 2 | -99 | 0 | 0 |
| 3 | 1 | 1 | 0 |
| 4 | 0 | 1 | 0 |
| 5 | 1 | 1 | 1 |
| 6 | -99 | 0 | 1 |
MAR lets us impute the sample mean of units with the same covariate values
| Unit | $Y_i^*$ | $R_i$ | $X_i$ | $Y_i$ |
|---|---|---|---|---|
| 1 | 1 | 1 | 0 | 1 |
| 2 | -99 | 0 | 0 | $\widehat E[Y_i|X_i=0] = 2/3$ |
| 3 | 1 | 1 | 0 | 1 |
| 4 | 0 | 1 | 0 | 0 |
| 5 | 1 | 1 | 1 | 1 |
| 6 | -99 | 0 | 1 | $\widehat E[Y_i|X_i=1] = 1$ |
We can impute the sample mean of units with the same covariate values
| Unit | $Y_i^*$ | $R_i$ |
|---|---|---|
| 1 | 1 | 1 |
| 2 | -99 | 0 |
| 3 | 1 | 1 |
| 4 | 0 | 1 |
| 5 | 1 | 1 |
| 6 | -99 | 0 |
Imagine we observe covariates \(X_{1i}\) and \(X_{2i}\)
| Unit | $Y_i^*$ | $R_i$ | $X_{1i}$ | $X_{2i}$ |
|---|---|---|---|---|
| 1 | 1 | 1 | 0 | 3 |
| 2 | -99 | 0 | 0 | 7 |
| 3 | 1 | 1 | 0 | 9 |
| 4 | 0 | 1 | 0 | 5 |
| 5 | 1 | 1 | 1 | 4 |
| 6 | -99 | 0 | 1 | 3 |
Imagine we observe covariates \(X_{1i}\) and \(X_{2i}\)
Regression estimation
Hot deck imputation
Inverse-probability weighting (IPW)
| Unit | $Y_i^*$ | $R_i$ | $X_{1i}$ | $X_{2i}$ |
|---|---|---|---|---|
| 1 | 1 | 1 | 0 | 3 |
| 2 | -99 | 0 | 0 | 7 |
| 3 | 1 | 1 | 0 | 9 |
| 4 | 0 | 1 | 0 | 5 |
| 5 | 1 | 1 | 1 | 4 |
| 6 | -99 | 0 | 1 | 3 |
Estimate \(\widehat Y_i^* = \widehat \beta_0 + \widehat \beta_1 X_{1i} + \widehat \beta_2 X_{2i}\) with observed data
| Unit | $Y_i^*$ | $R_i$ | $X_{1i}$ | $X_{2i}$ |
|---|---|---|---|---|
| 1 | 1 | 1 | 0 | 3 |
| 2 | -99 | 0 | 0 | 7 |
| 3 | 1 | 1 | 0 | 9 |
| 4 | 0 | 1 | 0 | 5 |
| 5 | 1 | 1 | 1 | 4 |
| 6 | -99 | 0 | 1 | 3 |
Use coefficients to impute missing values
| Unit | $Y_i^*$ | $R_i$ | $X_{1i}$ | $X_{2i}$ | $Y_i$ |
|---|---|---|---|---|---|
| 1 | 1 | 1 | 0 | 3 | 1 |
| 2 | -99 | 0 | 0 | 7 | $\widehat \beta_0 + \widehat \beta_1 \cdot 0 + \widehat \beta_2 \cdot 7$ |
| 3 | 1 | 1 | 0 | 9 | 1 |
| 4 | 0 | 1 | 0 | 5 | 0 |
| 5 | 1 | 1 | 1 | 4 | 1 |
| 6 | -99 | 0 | 1 | 3 | $\widehat \beta_0 + \widehat \beta_1 \cdot 1 + \widehat \beta_2 \cdot 3$ |
Which yields the same \(\widehat E[Y_i]\) as
| Unit | $Y_i^*$ | $R_i$ | $X_{1i}$ | $X_{2i}$ | $Y_i$ |
|---|---|---|---|---|---|
| 1 | 1 | 1 | 0 | 3 | $\widehat \beta_0 + \widehat \beta_1 \cdot 0 + \widehat \beta_2 \cdot 3$ |
| 2 | -99 | 0 | 0 | 7 | $\widehat \beta_0 + \widehat \beta_1 \cdot 0 + \widehat \beta_2 \cdot 7$ |
| 3 | 1 | 1 | 0 | 9 | $\widehat \beta_0 + \widehat \beta_1 \cdot 0 + \widehat \beta_2 \cdot 9$ |
| 4 | 0 | 1 | 0 | 5 | $\widehat \beta_0 + \widehat \beta_1 \cdot 0 + \widehat \beta_2 \cdot 5$ |
| 5 | 1 | 1 | 1 | 4 | $\widehat \beta_0 + \widehat \beta_1 \cdot 1 + \widehat \beta_2 \cdot 4$ |
| 6 | -99 | 0 | 1 | 3 | $\widehat \beta_0 + \widehat \beta_1 \cdot 1 + \widehat \beta_2 \cdot 3$ |
What if we could condition on a function of covariates instead?
Response propensity function
\[ p_R(\mathbf{x}) = Pr[R_i = 1 | \mathbf{X}_i = x] \]
\(p_R(\mathbf{X}_i)\) is the propensity score of unit \(i\)
Conditional probability of response given the covariates
Useful since you cannot condition on many covariates nonparametrically
Can estimate consistently via ML (usually logit)
Still need to assume specification is correct
| Unit | $Y_i^*$ | $R_i$ | $X_{1i}$ | $X_{2i}$ |
|---|---|---|---|---|
| 1 | 2 | 1 | 0 | 3 |
| 2 | -99 | 0 | 0 | 7 |
| 3 | 3 | 1 | 0 | 9 |
| 4 | 10 | 1 | 0 | 5 |
| 5 | 12 | 1 | 1 | 4 |
| 6 | -99 | 0 | 1 | 3 |
Add propensity score
| Unit | $Y_i^*$ | $R_i$ | $X_{1i}$ | $X_{2i}$ | $p_R(\mathbf{X}_i)$ |
|---|---|---|---|---|---|
| 1 | 2 | 1 | 0 | 3 | 0.33 |
| 2 | -99 | 0 | 0 | 7 | 0.14 |
| 3 | 3 | 1 | 0 | 9 | 0.73 |
| 4 | 10 | 1 | 0 | 5 | 0.35 |
| 5 | 12 | 1 | 1 | 4 | 0.78 |
| 6 | -99 | 0 | 1 | 3 | 0.70 |
Add propensity score
| Unit | $Y_i^*$ | $R_i$ | $X_{1i}$ | $X_{2i}$ | $p_R(\mathbf{X}_i)$ |
|---|---|---|---|---|---|
| 1 | 2 | 1 | 0 | 3 | 0.33 |
| 2 | -99 | 0 | 0 | 7 | 0.14 |
| 3 | 3 | 1 | 0 | 9 | 0.73 |
| 4 | 10 | 1 | 0 | 5 | 0.35 |
| 5 | 12 | 1 | 1 | 4 | 0.78 |
| 6 | -99 | 0 | 1 | 3 | 0.70 |
Impute values with nearest propensity score
| Unit | $Y_i^*$ | $R_i$ | $X_{1i}$ | $X_{2i}$ | $p_R(\mathbf{X}_i)$ | $Y_i$ |
|---|---|---|---|---|---|---|
| 1 | 2 | 1 | 0 | 3 | 0.33 | 2 |
| 2 | -99 | 0 | 0 | 7 | 0.14 | $\widehat E[Y_i|p_R(X_i) = 0.14]$ |
| 3 | 3 | 1 | 0 | 9 | 0.73 | 3 |
| 4 | 10 | 1 | 0 | 5 | 0.35 | 10 |
| 5 | 12 | 1 | 1 | 4 | 0.78 | 12 |
| 6 | -99 | 0 | 1 | 3 | 0.70 | $\widehat E[Y_i|p_R(X_i) = 0.70]$ |
Impute values with nearest propensity score
| Unit | $Y_i^*$ | $R_i$ | $X_{1i}$ | $X_{2i}$ | $p_R(\mathbf{X}_i)$ | $Y_i$ |
|---|---|---|---|---|---|---|
| 1 | 2 | 1 | 0 | 3 | 0.33 | 2 |
| 2 | -99 | 0 | 0 | 7 | 0.14 | 2 |
| 3 | 3 | 1 | 0 | 9 | 0.73 | 3 |
| 4 | 10 | 1 | 0 | 5 | 0.35 | 10 |
| 5 | 12 | 1 | 1 | 4 | 0.78 | 12 |
| 6 | -99 | 0 | 1 | 3 | 0.70 | 3 |
Impute values with nearest propensity score
| Unit | $Y_i^*$ | $R_i$ | $X_{1i}$ | $X_{2i}$ | $p_R(\mathbf{X}_i)$ | $Y_i$ |
|---|---|---|---|---|---|---|
| 1 | 2 | 1 | 0 | 3 | 0.33 | 2 |
| 2 | -99 | 0 | 0 | 7 | 0.14 | 2 |
| 3 | 3 | 1 | 0 | 9 | 0.73 | 3 |
| 4 | 10 | 1 | 0 | 5 | 0.35 | 10 |
| 5 | 12 | 1 | 1 | 4 | 0.78 | 12 |
| 6 | -99 | 0 | 1 | 3 | 0.70 | 3 |
Impute values with nearest propensity score
| Unit | $Y_i^*$ | $R_i$ | $X_{1i}$ | $X_{2i}$ | $p_R(\mathbf{X}_i)$ | $Y_i$ |
|---|---|---|---|---|---|---|
| 1 | 2 | 1 | 0 | 3 | 0.33 | 2 |
| 2 | -99 | 0 | 0 | 7 | 0.14 | 2 |
| 3 | 3 | 1 | 0 | 9 | 0.73 | 3 |
| 4 | 10 | 1 | 0 | 5 | 0.35 | 10 |
| 5 | 12 | 1 | 1 | 4 | 0.78 | 12 |
| 6 | -99 | 0 | 1 | 3 | 0.70 | 3 |
Not a good idea. Discards a lot of information!
The problem is not missing data
But an unrepresentative sample
If MAR holds, we can use propensity score to weight observations
Inverse probability weight \(\frac{1}{p_R(\mathbf{X}_i)}\)
Horvitz-Thompson estimator
\[ \widehat E_{IPW}[Y_i] = \frac{1}{n} \sum_{i=1}^n \frac{Y_i^* R_i}{\widehat p_R(\mathbf{X}_i)} \]
Low propensity scores are underrepresented \(\rightarrow\) Assign greater weight
High-propensity scores \(\rightarrow\) lower weight
Options under MAR
Regression estimation
Hot deck imputation
Inverse-probability weighting (IPW)
Options under MAR
Regression estimation
Hot deck imputation
Inverse-probability weighting (IPW)
Options under MAR
Regression estimation
Hot deck imputation
Inverse-probability weighting (IPW)
Doubly robust estimation
Regression specification may be incorrect
Propensity score specification may be incorrect
Combining the two in a model that yields consistent estimates when either is correct
\[ \begin{align} & \widehat E_{DR}[Y_i] = \\ &\frac{1}{n} \sum_{i=1}^n \widehat E[Y_i^*|R_i = 1, \mathbf{X}_i] + \frac{1}{n} \sum_{i=1}^n \frac{R_i(Y_i^* - \widehat E[Y_i^* | R = 1, \mathbf{X}_i])}{\widehat{p}_R(\mathbf{X}_i)} \end{align} \]
Regression specification may be incorrect
Propensity score specification may be incorrect
Combining the two in a model that yields consistent estimates when either is correct
\[ \widehat E_{DR}[Y_i] = \text{Regression estimator} + \text{IPW correction} \]
Get proportion missing
gss %>%
select(vote) %>%
mutate(
vote_lo = ifelse(is.na(vote), 0, vote),
vote_hi = ifelse(is.na(vote), 1, vote)) %>%
summarize(
complete_cases = mean(vote, na.rm = TRUE),
lower_bound = mean(vote_lo),
upper_bound = mean(vote_hi)
)# A tibble: 1 × 3
complete_cases lower_bound upper_bound
<dbl> <dbl> <dbl>
1 0.743 0.694 0.759
Make a new data with two more variables
outcome_model = lm(vote ~ female + dem,
data = gss_wide)
gss_wide$vote_pred = predict(outcome_model,
# newdata is by default current data in predict.lm
newdata = gss_wide)
head(gss_wide)# A tibble: 6 × 4
vote female dem vote_pred
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 0.842
2 1 0 0 0.663
3 1 1 0 0.669
4 1 1 1 0.842
5 1 0 0 0.663
6 1 0 1 0.836
gss_wide %>%
mutate(
vote_reg = ifelse(is.na(vote), vote_pred, vote)
) %>%
summarize(
complete_cases = mean(vote, na.rm = TRUE),
# Some still missing have missing data in the covariates
mar_reg_1 = mean(vote_reg, na.rm = TRUE),
mar_reg_2 = mean(vote_pred, na.rm = TRUE)
)# A tibble: 1 × 3
complete_cases mar_reg_1 mar_reg_2
<dbl> <dbl> <dbl>
1 0.743 0.740 0.741
# Make a binary indicator of missingness
gss_wide = gss_wide %>%
mutate(
response = ifelse(!is.na(vote), 1, 0)
) %>%
filter(!is.na(female) & !is.na(dem))
pscore_model = glm(
response ~ female + dem,
family = binomial,
data = gss_wide
)
# Create variable with probabilities
gss_wide$pscore = predict(pscore_model, type = "response")
# Convert to IPW
gss_wide$ipw = 1/gss_wide$pscoreMissing data means i.i.d. assumption does not hold even with a random/representative sample
But i.i.d. is assumed out of convenience, not accuracy
Many are unbothered by missing data because it does not make a difference
But sometimes ignoring may present problems for inference or the application of other methods
You want to (at least) show that ignoring missing data is not a big deal