Regression

POLI_SCI 403: Probability and Statistics

Agenda

Bivariate regression
Multivariate regression
Inference (lab)
Regression with matrix algebra (if we have time)

Bivariate regression

So far

Ingredients for statistical inference

Estimand \(\theta\)
Estimator \(\widehat \theta\)
Data \(X\)
Estimate \(\overline X\)
Quantify uncertainty (confidence intervals, p-values)

\(X \rightarrow \widehat{\theta} \rightarrow \overline{X} \xrightarrow{\text{hopefully!}} \theta\)

MSE and expected value

\[ \text{MSE} = E[(X-\color{purple}c)^2] \]

Alternative formula

\[ \text{MSE} = V[X] + (E[X] - \color{purple}c)^2 \]

\(E[X]\) minimizes MSE

\[ \text{MSE} = V[X] + (E[X] - \color{purple}{E[X]})^2 \]

In practice

This is all about one variable at a time!

We usually want to tell more nuanced stories:

Difference between/across groups
Relationship between two variables
Difference in the relationship between two variables between/across groups

Toy example

How to summarize?

Connect with a line

Maybe smoother?

A straight line?

Straight lines are good

They can be written as a linear equation

\[ y = a + bx \]

\(y\)	Outcome variable
\(x\)	Explanatory variable
\(b\)	Slope (\(\frac{\text{rise}}{\text{run}}\))
\(a\)	y-intercept

This is the smallest number of parameters to draw a line

We can think of intercept and slope as estimands

Straight lines are good

They can be written as a linear equation

\[ Y = \alpha + \beta X \]

\(Y\)	Outcome variable
\(X\)	Explanatory variable
\(\beta\)	Slope (\(\frac{\text{rise}}{\text{run}}\))
\(\alpha\)	y-intercept

This is the smallest number of parameters to draw a line

We can think of intercept and slope as estimands

Conditional expectations

CEF \(E[Y|X]\) minimizes MSE of \(Y\) given \(X\)

If we restrict ourselves to a linear functional form \(Y = a + bX\)

then the following minimize MSE of \(Y\) given \(X\):

\(g(X) = \alpha + \beta X\) where
\(\alpha = E[Y] - \frac{\text{Cov}[X,Y]}{V[X]}E[X]\)
\(\beta = \frac{\text{Cov}[X,Y]}{V[X]}\)

So it’s really all about \(\beta\)

Aside: Regression and correlation

Informally, we use regression coefficients (slopes) to determine whether two variables are correlated

Technically, they are related but on a different scale

Regression coefficient: \(\beta = \frac{\text{Cov}[X,Y]}{V[X]}\)

Correlation: \(\rho = \frac{\text{Cov}[X,Y]}{SD[X] SD[Y]}\)

Aside: Regression and correlation

Informally, we use regression coefficients (slopes) to determine whether two variables are correlated

Technically, they are related but on a different scale

Regression coefficient: \(\beta = \frac{\text{Cov}[X,Y]}{V[X]}\) \(\Rightarrow\) in units of Y (happiness)

(Pearson’s) correlation: \(\rho = \frac{\text{Cov}[X,Y]}{SD[X] SD[Y]}\)

Aside: Regression and correlation

Informally, we use regression coefficients (slopes) to determine whether two variables are correlated

Technically, they are related but on a different scale

Regression coefficient: \(\beta = \frac{\text{Cov}[X,Y]}{V[X]}\) \(\Rightarrow\) in units of Y (happiness)

(Pearson’s) correlation: \(\rho = \frac{\text{Cov}[X,Y]}{SD[X] SD[Y]}\) \(\Rightarrow\) [-1, 1] scale

What are we doing?

Before: Assume there is a true parameter that we do not observe (e.g. population mean, ATE)
Now: Assume there is a true line that best describes the relationship between X and Y
There is a best linear predictor that we want to estimate

But how do we find it!?

Which line is a better summary?

More formally

The best linear predictor is the line that minimizes the distance of each observation to the line
That distance is known as residual or error

cookies_base +
  geom_smooth(method = "lm", color = nu_purple, se = FALSE)

Visualizing residuals

Plot
Code

cookies_with_residual <- cookies_base +
  geom_smooth(method = "lm", color = nu_purple, se = FALSE) +
  geom_segment(aes(xend = cookies, yend = .fitted), color = "#FF851B", size = 1)

cookies_with_residual

More formally

The best linear predictor is the line that minimizes the distance of each observation to to the line
That distance is know as a residual or error

\[ e_i = (y_i - \widehat y_i) \]

More formally

The best linear predictor is the line that minimizes the distance of each observation to to the line
That distance is know as a residual or error

\[ e_i = (y_i - (b_0 + b_1 x_{1i})) \]

Minimizing residuals

We want to find a vector of coefficients (\(\widehat \beta_0\), \(\widehat \beta_1\)) that minimizes the sum of squared residuals

\[ SSR = \sum_{i=1}^n e_i^2 \]

We could try many lines until we find the the smallest SSR

Or use a method called Ordinary Least Squares (OLS)

OLS regression by plug-in principle

Estimand

\(\alpha = E[Y] - \frac{\text{Cov}[X,Y]}{V[X]}E[X] \qquad \beta = \frac{\text{Cov}[X,Y]}{V[X]}\)

Estimator

\(\widehat\alpha = \overline Y - \frac{\overline{XY} - \overline{X} \cdot \overline{Y}}{\overline{X^2} - \overline{X}^2} \overline{X} \qquad \widehat{\beta} = \frac{\overline{XY} - \overline{X} \cdot \overline{Y}}{\overline{X^2} - \overline{X}^2}\)

Back to cookies

\[ \widehat{y} = \widehat \beta_0 + \widehat \beta_1 x_1 \]

Back to cookies

\[ \widehat{\text{happiness}} = \beta_0 + \beta _1 \text{cookies} \]

Back to cookies

\(\widehat{\text{happiness}} = \beta_0 + \beta_1 \text{cookies}\)

Back to cookies

\(\widehat{\text{happiness}} = \beta_0 + \beta_1 \text{cookies}\)

Back to cookies

\(\widehat{\text{happiness}} = \beta_0 + \beta_1 \text{cookies}\)

	happinness
(Intercept)	1.100*
	(0.470)
cookies	0.164+
	(0.076)
Num.Obs.	10
R2	0.368
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Back to cookies

\(\widehat{\text{happiness}} = \beta_0 + \beta_1 \text{cookies}\)

	happinness
(Intercept)	1.100*
	(0.470)
cookies	0.164
	(0.076)
Num.Obs.	10
R2	0.368
* p < 0.05

Back to cookies

\(\widehat{\text{happiness}} = 1.10 + 0.16 \cdot \text{cookies}\)

	happinness
(Intercept)	1.100*
	(0.470)
cookies	0.164
	(0.076)
Num.Obs.	10
R2	0.368
* p < 0.05

Back to cookies

\(\widehat{\text{happiness}} = 1.10 + 0.16 \cdot \text{cookies}\)

	happinness
(Intercept)	1.100*
	(0.470)
cookies	0.164
	(0.076)
Num.Obs.	10
R2	0.368
* p < 0.05

On average

Back to cookies

\(\widehat{\text{happiness}} = 1.10 + 0.16 \cdot \text{cookies}\)

	happinness
(Intercept)	1.100*
	(0.470)
cookies	0.164
	(0.076)
Num.Obs.	10
R2	0.368
* p < 0.05

On average, one additional cookie

Back to cookies

\(\widehat{\text{happiness}} = 1.10 + 0.16 \cdot \text{cookies}\)

	happinness
(Intercept)	1.100*
	(0.470)
cookies	0.164
	(0.076)
Num.Obs.	10
R2	0.368
* p < 0.05

On average, one additional cookie increases happiness by 0.16 points

Helpful for comparison

Is 0.16 happiness points per cookie a lot?

We cannot tell without a point of reference

But correlation is a reference on its own:

Absolute magnitude	Effect
0.1	Small
0.3	Moderate
0.5	Large

Back to cookies

\(\widehat{\text{happiness}} = 1.10 + 0.16 \cdot \text{cookies}\)

	happinness
(Intercept)	1.100*
	(0.470)
cookies	0.164
	(0.076)
Num.Obs.	10
R2	0.368
* p < 0.05

On average, one additional cookie increases happiness by 0.16 points

Back to cookies

\(\widehat{\text{happiness}} = 1.10 + 0.16 \cdot \text{cookies}\)

	happinness
(Intercept)	1.100*
	(0.470)
cookies	0.164
	(0.076)
Num.Obs.	10
R2	0.368
* p < 0.05

On average, one additional cookie increases happiness by 0.16 points

Corresponds to \(\rho =\) 0.61

Multivariate regression

Problem

What if we wanted to include more variables?

Of course we can!

But why would we want do that?

Depends on what you want to say

Level 1: Description of conditional means
Level 2: Statistical inference (needs CIs or p-values)
Level 3: Causal inference (needs assumptions)

My take: You only care about level 2 because you have (at least) a proto-causal story

Meaning there is a focal X and Y relationship we want to measure, but it is “contaminated” by some confounder

We need to “adjust” or “control” the effect of X on Y (more in 3 weeks)

Ordinary Least Squares (OLS)

Rewrite with an arbitrary number of variables

Function \(\widehat{g} = \widehat \beta_0 + \widehat \beta_1 X_{[1]} + \widehat \beta_2 X_{[2]} + \ldots + \widehat \beta_K X_{[K]}\)

Such that

\[ \mathbf{\widehat \beta} = (\widehat\beta_0, \widehat\beta_1, \ldots, \widehat\beta_K) = \underset{(b_0, b_1, \ldots, b_K) \in \mathbb{R}^{K + 1}}{\text{argmin}} \frac{1}{n} \sum_{i = 1}^n e_i^2 \]

Where \(e_i = (Y_i - (b_0 + b_1 X_{[1]i} + b_2 X_{[2]i} + \ldots + b_K X_{[K]i}))\)

Ordinary Least Squares (OLS)

Rewrite with an arbitrary number of variables

Function \(\widehat{g} = \widehat \beta_0 + \widehat \beta_1 X_{[1]} + \widehat \beta_2 X_{[2]} + \ldots + \widehat \beta_K X_{[K]}\)

Such that

\[ \mathbf{\widehat \beta} = (\widehat\beta_0, \widehat\beta_1, \ldots, \widehat\beta_K) = \underset{(b_0, b_1, \ldots, b_K) \in \mathbb{R}^{K + 1}}{\text{argmin}} \frac{1}{n} \sum_{i = 1}^n e_i^2 \]

Where \(e_i = (Y_i - \widehat Y_i)\)

Ordinary Least Squares (OLS)

Rewrite with an arbitrary number of variables

Function \(\widehat{g} = \widehat \beta_0 + \widehat \beta_1 X_{[1]} + \widehat \beta_2 X_{[2]} + \ldots + \widehat \beta_K X_{[K]}\)

Such that

\[ \mathbf{\widehat \beta} = (\widehat\beta_0, \widehat\beta_1, \ldots, \widehat\beta_K) = \underset{(b_0, b_1, \ldots, b_K) \in \mathbb{R}^{K + 1}}{\text{argmin}} \frac{1}{n} \sum_{i = 1}^n e_i^2 \]

Where \(e_i = (\text{actual} - \text{predicted})\)

ANES 2016 data

# remotes::install_github("jamesmartherus/anesr")
library(anesr)


data(timeseries_2016)

Variables

Code
Data

nes16 = timeseries_2016 %>% 
  select(
    V162079, # Feeling thermometer for TRUMP [POST]
    V162230x, # Better if man works and woman takes care of home [POST]
    V162255, # Is Barack Obama Muslim (yes/no) [POST]
    V161267, # Respondent Age [PRE]
    V161270, # Highest level of Education (Years) [PRE]
    V161158x # Party ID [PRE]
  )

# A tibble: 4,270 × 6
   V162079    V162230x   V162255    V161267    V161270    V161158x  
   <hvn_lbll> <hvn_lbll> <hvn_lbll> <hvn_lbll> <hvn_lbll> <hvn_lbll>
 1 85         4           2         29          9         7         
 2 60         4           1         26         13         6         
 3 70         1           1         23          9         3         
 4 60         1           2         58          9         5         
 5 15         4           1         38          9         3         
 6 65         4          -8         60         14         5         
 7 50         4          -8         58          5         1         
 8 85         1           1         56          9         4         
 9 70         4           2         45          9         5         
10 60         4          -8         30         10         3         
# ℹ 4,260 more rows

Recode

Mutate
Select
Data

nes16 = nes16 %>% 
  mutate(
    ft_trump_post = ifelse(V162079 < 0 | V162079 == 998, NA, V162079),
    women_at_home = case_when(V162230x < 0 ~ NA,
                              V162230x <= 3 ~ 1,
                              V162230x <= 7 ~ 0),
    obamamuslim = ifelse(V162255 == 1, 1, 0),
    age = ifelse(V161267 < 0, NA, V161267),
    age0 = age - 18,
    educ_hs = case_when(V161270 < 0 ~ NA,
                        V161270 >= 90 ~ NA,
                        V161270 >= 9 ~ 1,
                        V161270 <= 8 ~ 0),
    republican = case_when(V161158x < 0 ~ NA,
                           V161158x <= 4 ~ 0,
                           V161158x <= 7 ~ 1)
  )

nes = nes16 %>% 
  select(ft_trump_post,
         women_at_home, obamamuslim,
         age, age0,
         educ_hs, republican)

# A tibble: 4,270 × 7
   ft_trump_post women_at_home obamamuslim   age  age0 educ_hs republican
           <dbl>         <dbl>       <dbl> <dbl> <dbl>   <dbl>      <dbl>
 1            85             0           0    29    11       1          1
 2            60             0           1    26     8       1          1
 3            70             1           1    23     5       1          0
 4            60             1           0    58    40       1          1
 5            15             0           1    38    20       1          0
 6            65             0           0    60    42       1          1
 7            50             0           0    58    40       0          0
 8            85             1           1    56    38       1          0
 9            70             0           0    45    27       1          1
10            60             0           0    30    12       1          0
# ℹ 4,260 more rows

Main relationship

Plot
Code

ggplot(nes) +
  aes(x = obamamuslim,
      y = ft_trump_post) +
  geom_jitter(size = 1, 
              width = 0.3,
              alpha = 0.8,
              color = nu_purple) +
  scale_x_continuous(breaks = c(0, 1))

Regression as conditional means

\(\widehat{\texttt{ft_trump_post}} = \beta_0 + \beta_1 \cdot \texttt{obamamuslim}\)

lm_race = lm(ft_trump_post ~ obamamuslim, data = nes)

term	estimate	std.error	p.value
(Intercept)	32.17	0.61	0
obamamuslim	35.04	1.15	0

Regression as conditional means

\(\widehat{\texttt{ft_trump_post}} = 32.17 + 35.04 \cdot \texttt{obamamuslim}\)

lm_race = lm(ft_trump_post ~ obamamuslim, data = nes)

term	estimate	std.error	p.value
(Intercept)	32.17	0.61	0
obamamuslim	35.04	1.15	0

What is the average feeling thermometer for someone who does not believe Obama is a Muslim?
What is the average feeling thermometer for someone who does believe Obama is a Muslim?

Regression as conditional means

Compare with

nes %>% 
  select(ft_trump_post, obamamuslim) %>% 
  drop_na() %>% 
  group_by(obamamuslim) %>% 
  summarize(estimate = mean(ft_trump_post))

# A tibble: 2 × 2
  obamamuslim estimate
        <dbl>    <dbl>
1           0     32.2
2           1     67.2

sum(coef(lm_race))

[1] 67.20505

Maybe just an education thing

coef(lm(obamamuslim ~ educ_hs, data = nes))

(Intercept)     educ_hs 
 0.28368794 -0.04484413

coef(lm(ft_trump_post ~ educ_hs, data = nes))

(Intercept)     educ_hs 
  44.991228   -3.083431

with(nes, cor(educ_hs, obamamuslim, use = "pairwise.complete.obs"))

[1] -0.02613526

Absolute magnitude	Effect
0.1	Small
0.3	Moderate
0.5	Large

Or maybe an age thing

coef(lm(obamamuslim ~ age, data = nes))

(Intercept)         age 
0.176466090 0.001301993

summary(nes$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  18.00   34.00   50.00   49.58   63.00   90.00     121

summary(nes$age0)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   16.00   32.00   31.58   45.00   72.00     121

coef(lm(obamamuslim ~ age0, data = nes))

(Intercept)        age0 
0.199901973 0.001301993

Or maybe an age thing

coef(lm(ft_trump_post ~ age0, data = nes))

(Intercept)        age0 
 34.3998310   0.2424554

with(nes, cor(age, obamamuslim, use = "pairwise.complete.obs"))

[1] 0.05350295

Or partisan motivated reasoning

coef(lm(obamamuslim ~ republican, data = nes))

(Intercept)  republican 
  0.1445592   0.2388995

coef(lm(ft_trump_post ~ republican, data = nes))

(Intercept)  republican 
   24.21449    43.88579

with(nes, cor(republican, obamamuslim, use = "pairwise.complete.obs"))

[1] 0.2741116

Adjusting for covariates

coef(lm(ft_trump_post ~ obamamuslim, data = nes))

(Intercept) obamamuslim 
   32.16788    35.03717

coef(lm(ft_trump_post ~ obamamuslim + educ_hs, data = nes))

(Intercept) obamamuslim     educ_hs 
 32.7625038  34.8518643  -0.5570192

coef(lm(ft_trump_post ~ obamamuslim + educ_hs + age0, data = nes))

 (Intercept)  obamamuslim      educ_hs         age0 
26.557225629 34.563705332  0.006911287  0.183081786

coef(lm(ft_trump_post ~ obamamuslim + educ_hs + age0 + republican,
        data = nes))

(Intercept) obamamuslim     educ_hs        age0  republican 
 23.1524906  22.2249306  -6.2160330   0.1007929  37.5392727

Side by side

	(1)	(2)	(3)	(4)	(5)
(Intercept)	32.168*	32.763*	23.090*	20.367*	21.338*
	(0.611)	(2.104)	(1.574)	(0.581)	(2.157)
obamamuslim	35.037*	34.852*	34.715*	22.742*	22.225*
	(1.147)	(1.154)	(1.162)	(0.996)	(1.016)
educ_hs		-0.557			-6.216*
		(2.135)			(1.788)
age			0.185*		0.101*
			(0.030)		(0.025)
republican				37.533*	37.539*
				(0.913)	(0.933)
Num.Obs.	3632	3601	3536	3616	3496
R2	0.204	0.203	0.214	0.459	0.462
* p < 0.05

Side by side

	(1)	(2)	(3)	(4)	(5)
(Intercept)	32.168*	32.763*	23.090*	20.367*	21.338*
	(0.611)	(2.104)	(1.574)	(0.581)	(2.157)
obamamuslim	35.037*	34.852*	34.715*	22.742*	22.225*
	(1.147)	(1.154)	(1.162)	(0.996)	(1.016)
educ_hs		-0.557			-6.216*
		(2.135)			(1.788)
age			0.185*		0.101*
			(0.030)		(0.025)
republican				37.533*	37.539*
				(0.913)	(0.933)
Num.Obs.	3632	3601	3536	3616	3496
R2	0.204	0.203	0.214	0.459	0.462
* p < 0.05

Side by side

	(1)	(2)	(3)	(4)	(5)
(Intercept)	32.168*	32.763*	23.090*	20.367*	21.338*
	(0.611)	(2.104)	(1.574)	(0.581)	(2.157)
obamamuslim	35.037*	34.852*	34.715*	22.742*	22.225*
	(1.147)	(1.154)	(1.162)	(0.996)	(1.016)
educ_hs		-0.557			-6.216*
		(2.135)			(1.788)
age			0.185*		0.101*
			(0.030)		(0.025)
republican				37.533*	37.539*
				(0.913)	(0.933)
Num.Obs.	3632	3601	3536	3616	3496
R2	0.204	0.203	0.214	0.459	0.462
* p < 0.05

Side by side

	(1)	(2)	(3)	(4)	(5)
(Intercept)	32.168*	32.763*	23.090*	20.367*	21.338*
	(0.611)	(2.104)	(1.574)	(0.581)	(2.157)
obamamuslim	35.037*	34.852*	34.715*	22.742*	22.225*
	(1.147)	(1.154)	(1.162)	(0.996)	(1.016)
educ_hs		-0.557			-6.216*
		(2.135)			(1.788)
age			0.185*		0.101*
			(0.030)		(0.025)
republican				37.533*	37.539*
				(0.913)	(0.933)
Num.Obs.	3632	3601	3536	3616	3496
R2	0.204	0.203	0.214	0.459	0.462
* p < 0.05

Side by side

	(1)	(2)	(3)	(4)	(5)
(Intercept)	32.168*	32.763*	23.090*	20.367*	21.338*
	(0.611)	(2.104)	(1.574)	(0.581)	(2.157)
obamamuslim	35.037*	34.852*	34.715*	22.742*	22.225*
	(1.147)	(1.154)	(1.162)	(0.996)	(1.016)
educ_hs		-0.557			-6.216*
		(2.135)			(1.788)
age			0.185*		0.101*
			(0.030)		(0.025)
republican				37.533*	37.539*
				(0.913)	(0.933)
Num.Obs.	3632	3601	3536	3616	3496
R2	0.204	0.203	0.214	0.459	0.462
* p < 0.05

Side by side

	(1)	(2)	(3)	(4)	(5)
(Intercept)	32.168*	32.763*	23.090*	20.367*	21.338*
	(0.611)	(2.104)	(1.574)	(0.581)	(2.157)
obamamuslim	35.037*	34.852*	34.715*	22.742*	22.225*
	(1.147)	(1.154)	(1.162)	(0.996)	(1.016)
educ_hs		-0.557			-6.216*
		(2.135)			(1.788)
age			0.185*		0.101*
			(0.030)		(0.025)
republican				37.533*	37.539*
				(0.913)	(0.933)
Num.Obs.	3632	3601	3536	3616	3496
R2	0.204	0.203	0.214	0.459	0.462
* p < 0.05

We are (presumably) reducing bias, but increasing variance (standard errors)

Visualizing

Everything else constant

Plug-in coefficients in equations:

\(\widehat{\texttt{ft_trump_post}} = 32.17 + 35.04 \cdot \texttt{obamamuslim}\)

\(\widehat{\texttt{ft_trump_post}} = 32.76 + 34.85 \cdot \texttt{obamamuslim} -0.56 \cdot \texttt{educ_hs}\)

\(\widehat{\texttt{ft_trump_post}} = 23.09 + 34.72 \cdot \texttt{obamamuslim} + 0.19 \cdot \texttt{age}\)

\(\widehat{\texttt{ft_trump_post}} = 20.367 + 22.74 \cdot \texttt{obamamuslim} + 37.53 \cdot \texttt{republican}\)

\(\widehat{\texttt{ft_trump_post}} = 21.34 + 22.23 \cdot \texttt{obamamuslim} -6.22 \cdot \texttt{educ_hs} +\\ 0.10 \cdot \texttt{age} + 37.54 \cdot \texttt{republican}\)

Coefficients now need to be interpreted as marginal means or marginal slopes

These only make sense if you think at least one variable is a focal point

Heterogeneous effects

What if we believed the effect of obamamuslim varies depending on attitudes about gender roles?

Interactions

Code

Math

coef(lm(ft_trump_post ~ women_at_home,
        data = nes))

  (Intercept) women_at_home 
     35.84525      17.58167

\(y = \beta_0 + \beta_1 \mathtt{\text{women_at_home}} + \varepsilon\)

coef(lm(ft_trump_post ~ 
          obamamuslim + women_at_home,
        data = nes))

  (Intercept)   obamamuslim women_at_home 
     28.21032      32.92204      12.66395

\(y = \beta_0 + \beta_1 \mathtt{obamamuslim} + \beta_2 \mathtt{\text{women_at_home}} + \varepsilon\)

coef(lm(ft_trump_post ~ 
          obamamuslim * women_at_home,
        data = nes)) %>% 
  tidy() # easier to read in slides

# A tibble: 4 × 2
  names                         x
  <chr>                     <dbl>
1 (Intercept)               27.7 
2 obamamuslim               34.9 
3 women_at_home             14.2 
4 obamamuslim:women_at_home -4.74

\[ \begin{align*} y = & \beta_0 + \beta_1 \mathtt{obamamuslim} + \beta_2 \mathtt{\text{women_at_home}} + \\ & \beta_3 \mathtt{obamamuslim} \times \mathtt{\text{women_at_home}} + \varepsilon \end{align*} \]

Caution

The linear model setup is flexible

\[\widehat{y} = \widehat \beta_0 + \widehat \beta_1 x_1 + \widehat \beta_2 x_2 + \ldots + \widehat \beta_K x_K\]

You can technically put whatever you want in a regression as long as \(\text{observations} > \text{variables}\)

But for statistical or causal inference, anything with more than 2-3 control variables doesn’t make much sense

Increasing dimensions

More dimensions \(\rightarrow\) more likely to see:

Extrapolation: Fitting line beyond actual range
Interpolation: Gaps within actual range

Illusion of learning from empty space!

AM call this overfitting

Wrapping up

Regression as the go-to method to estimate conditional means
Best Linear Unbiased Estimator (BLUE)
Multivariate regression only makes sense with a marginal means interpretation
Which requires a (proto-)causal story
And stops making sense with too many controls

Fun: OLS with matrix algebra

Outcome: \(n \times 1\) column vector

\[Y = \begin{bmatrix} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_n \end{bmatrix}\]

Fun: OLS with matrix algebra

Explanatory variables: \(n \times (k+1)\) matrix

\[X= \begin{bmatrix} 1 & x_{11} & \dots & x_{1k} \\ 1 & x_{21} & \dots & x_{2k} \\ \vdots & \vdots & \dots & \vdots \\ 1 & x_{n1} & \dots & x_{nk} \end{bmatrix}\]

\(x_{ij}\) is the \(i\)-th observation of the \(j\)-th explanatory variable.

Linear regression model

Let’s say we have 173 observations (n = 173) and 2 explanatory variables (k = 3)

Equation: \(y = \beta_0 + \beta_1x_1 + \beta_2x_2\)

Matrix form: \[\begin{aligned} \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & x_{21} \\ 1 & x_{21} & x_{22} \\ \vdots & \vdots & \vdots \\ 1 & x_{1173} & x_{2173} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{bmatrix}\end{aligned} \]

Estimation

\[\hat{\beta} = (X'X)^{-1}X'Y\]

“X prime X inverse, X prime Y”

Example

cars_df = mtcars

Y = as.matrix(cars_df$mpg)

      [,1]
 [1,] 21.0
 [2,] 21.0
 [3,] 22.8
 [4,] 21.4
 [5,] 18.7
 [6,] 18.1
 [7,] 14.3
 [8,] 24.4
 [9,] 22.8
[10,] 19.2
[11,] 17.8
[12,] 16.4
[13,] 17.3
[14,] 15.2
[15,] 10.4
[16,] 10.4
[17,] 14.7
[18,] 32.4
[19,] 30.4
[20,] 33.9
[21,] 21.5
[22,] 15.5
[23,] 15.2
[24,] 13.3
[25,] 19.2
[26,] 27.3
[27,] 26.0
[28,] 30.4
[29,] 15.8
[30,] 19.7
[31,] 15.0
[32,] 21.4

Example

cars_df = mtcars

Y = as.matrix(cars_df$mpg)

# create two separate matrices for IVs
X1 = as.matrix(cars_df$hp)
X2 = as.matrix(cars_df$wt)

# bind them altogether into one matrix
# with constant column
constant =  rep(1, nrow(cars_df))
X = cbind(constant, X1, X2)

      constant          
 [1,]        1 110 2.620
 [2,]        1 110 2.875
 [3,]        1  93 2.320
 [4,]        1 110 3.215
 [5,]        1 175 3.440
 [6,]        1 105 3.460
 [7,]        1 245 3.570
 [8,]        1  62 3.190
 [9,]        1  95 3.150
[10,]        1 123 3.440
[11,]        1 123 3.440
[12,]        1 180 4.070
[13,]        1 180 3.730
[14,]        1 180 3.780
[15,]        1 205 5.250
[16,]        1 215 5.424
[17,]        1 230 5.345
[18,]        1  66 2.200
[19,]        1  52 1.615
[20,]        1  65 1.835
[21,]        1  97 2.465
[22,]        1 150 3.520
[23,]        1 150 3.435
[24,]        1 245 3.840
[25,]        1 175 3.845
[26,]        1  66 1.935
[27,]        1  91 2.140
[28,]        1 113 1.513
[29,]        1 264 3.170
[30,]        1 175 2.770
[31,]        1 335 3.570
[32,]        1 109 2.780

Calculate \(\hat{\beta} = (X'X)^{-1}X'Y\)

# X prime X
XpX = t(X) %*% X

# X prime X inverse
XpXinv = solve(XpX)

# X prime Y
XpY = t(X) %*% Y

# beta coefficient estimates
# X prime X inverse, X prime Y
bhat = XpXinv %*% XpY

Compare:

c(bhat)

[1] 37.22727012 -0.03177295 -3.87783074

lm(mpg ~ hp + wt, data = mtcars) %>% coef()

(Intercept)          hp          wt 
37.22727012 -0.03177295 -3.87783074