Uncertainty

POLI_SCI 403: Probability and Statistics

Agenda

Confidence intervals
Hypothesis testing
Lab

So far

Random variables to think about statistical properties before collecting data
i.i.d. sample to enable inference from estimators to estimands
Statistical properties of (point) estimators

This week: Convey uncertainty around estimates

Remember

There are two kinds of variance estimators

Sample variance: \(\widehat V[X]\) (…of random variable \(X\))

Sampling variance: \(V[\overline X]\) (…of an estimator)

We usually report sample variance (or SD) to describe our data

We use sampling variance to convey uncertainty around the estimates we produce

Data

library(gssr)

gss22 = gss_get_yr(2022)

gss = gss22 %>% 
  select(vote20) %>% 
  mutate(vote = ifelse(vote20 == 1, 1, 0)) %>% 
  drop_na()

gss

# A tibble: 3,876 × 2
   vote20            vote
   <dbl+lbl>        <dbl>
 1 1 [voted]            1
 2 1 [voted]            1
 3 1 [voted]            1
 4 1 [voted]            1
 5 1 [voted]            1
 6 1 [voted]            1
 7 2 [did not vote]     0
 8 1 [voted]            1
 9 1 [voted]            1
10 1 [voted]            1
# ℹ 3,866 more rows

Pretend whole sample is the population

Make function to get sample mean

sample_mean = function(data, n){
  data %>% 
    sample_n(size = n) %>% 
    summarize(
      mean = mean(vote)
    )
}

Check

set.seed(1234)
sample_mean(gss, 100)

# A tibble: 1 × 1
   mean
  <dbl>
1   0.8

Repeat many times

Code
Output

set.seed(1234)
vote_df = do(1000) * sample_mean(gss, 100)

vote_df

This gives a resampling distribution

Plot
Code

ggplot(vote_df) +
  aes(x = mean) +
  geom_density(linewidth = 2,
               color = "purple") +
  labs(x = "Estimate",
       y = "Density")

Then we estimate the variance of the resampling distribution

vote_df %>% 
  summarize(
    variance = var(mean),
    std.error = sd(mean)
  )

    variance  std.error
1 0.00182398 0.04270808

Good?

Hold on

No one repeats the study many times!

If you have resources for 100 participants 1,000 times

You have resources for 100,000 participants one time!

What we do instead

Leverage asymptotic properties (CLT) to find a shortcut to calculate uncertainty around our estimate without having to redo the whole study many times

This is called calculating standard errors (confidence intervals, p-values) via analytic derivation

They are (only) asymptotically valid iff i.i.d. assumption holds

Which is implies the CLT will “kick in” with a large enough sample

Confidence intervals

Steps

Choose \(\alpha \in (0,1)\)

Confidence level is \(100 \times (1-\alpha)\)

Choose estimand \(\theta\) and estimator \(\widehat \theta\)

Steps

Then we get normal approximation-based confidence intervals

\[ CI_{1-\alpha}(\theta) = (\widehat \theta - z_{(1-\frac{\alpha}{2})} \sqrt{\widehat V [\widehat{\theta}]}, \widehat \theta + z_{(1-\frac{\alpha}{2})} \sqrt{\widehat V [\widehat{\theta}]}) \]

where \(z_*\) denotes the quantile of the standard normal distribution \(N(0,1)\)

Steps

Then we get normal approximation-based confidence intervals

\[ CI_{1-\alpha}(\theta) = (\widehat \theta - z_{(1-\frac{\alpha}{2})} \sigma [\widehat{\theta}], \widehat \theta + z_{(1-\frac{\alpha}{2})} \sigma [\widehat{\theta}]) \]

where \(z_*\) denotes the quantile of the standard normal distribution \(N(0,1)\)

Why the standard normal?

The idea is that by asymptotic normality

\[ \sqrt{n} (\widehat \theta - \theta) \xrightarrow{d} N(0, \phi^2) \]

which is annoying because variance \(\phi^2\) is unknown

But if i.i.d. holds, there is transformation \(Z\) with known distribution

\[ Z \xrightarrow{d} N(0,1) \]

Why the standard normal?

Confidence level	\(\alpha\)	\(z\)
90%	0.10	1.64
95%	0.05	1.96
99%	0.01	2.58

Why 95%?

CIs for the sample mean

\[ CI_{1-\alpha}(\theta) = (\widehat \theta - z_{(1-\frac{\alpha}{2})} \sigma [\widehat{\theta}], \widehat \theta + z_{(1-\frac{\alpha}{2})} \sigma [\widehat{\theta}]) \]

CIs for the sample mean

\[ CI_{1-\alpha}(\mu) = (\widehat \mu - z_{(1-\frac{\alpha}{2})} \sigma [\widehat{\mu}], \widehat \mu + z_{(1-\frac{\alpha}{2})} \sigma [\widehat{\mu}]) \]

CIs for the sample mean

\[ CI_{0.95}(\mu) = (\widehat \mu - z_{(0.975)} \sigma [\widehat{\mu}], \widehat \mu + z_{(0.975)} \sigma [\widehat{\mu}]) \]

CIs for the sample mean

\[ CI_{0.95}(\mu) = (\widehat \mu - 1.96 \times \sigma [\widehat{\mu}], \widehat \mu + 1.96 \times \sigma [\widehat{\mu}]) \]

CIs for the sample mean

\[ CI_{0.95}(\mu) = (\overline X - 1.96 \times \widehat \sigma [\overline{X}], \overline X + 1.96 \times \widehat \sigma [\overline{X}]) \]