POLI_SCI 403: Probability and Statistics
In your own words, what is a random variable?
In either case, they are just functions:
\[ f(x) = Pr[X = x] \]
They important part is that we can operate on them, meaning we can theorize about producing single number summaries that describe them
Discrete:
\[ E[X] = \sum_x x f(x) \]
Continuous:
\[ E[X] = \int_{-\infty}^{+\infty} xf(x)dx \]
Discrete:
\[ E[X] = \sum_x x f(x) \]
Continuous:
\[ E[X] = \int_{-\infty}^{+\infty} xf(x)dx \]
We said a function applied to a random variable yields a random variable
But an operator of a random variable can be treated as a constant
So we can think of them as (theoretical) one-number summaries of a random variable
Expectations of summations are summations of expectations (ESSE)
If \(a\) is a constant, then \(E[a] = a\)
Extension: \(E[aX] = a E[X]\)
ESSE: \(E[X_1 + \ldots + X_p] = E[X_1] + \ldots + E[X_p]\)
Expectations of weighted summations are summations of weighted expectations (EWSSWE)
If \(a\) is a constant, then \(E[a] = a\)
Extension: \(E[aX] = a E[X]\)
ESSE: \(E[X_1 + \ldots + X_p] = E[X_1] + \ldots + E[X_p]\)
EWSSWE: \(E[b_1X_1 + \ldots + b_pX_p] = b_1E[X_1] + \ldots + b_pE[X_p]\)
Let \(X\) and \(Y\) be random variables, then \(\forall a, b, c \in \mathbb{R}\)
\[ E[aX + bY + c] = a E[X] + bE[Y] + c \]
Implication: Expected value is a linear operator
Meaning we can rearrange terms for easier calculation
For other things, linearity is not true unless you are explicitly told otherwise
We can describe a random variable by its moments
\(\mu'_j = E[X^j]\) is the \(j\)th raw moment
So \(E[X] = E[X^1] = \mu'_1\) is the first raw moment
\(\mu'_j = E[X^j]\) is the \(j\)th raw moment
Raw moments describe “location”
Sometimes we also want to characterize “spread”, independent from the expected value
\[ \mu_j = E[(X - E[X])^j] \]
is the \(j\)th central moment
Because they convey spread centered in the expected value
The first central moment is
\[E[X-E[X]]\]
The first central moment is
\[E[X-E[X]] \\ = E[X] - E[X]\]
The first central moment is
\[E[X-E[X]] \\ = E[X] - E[X] \\ = 0\]
Which is not very useful
The second central moment is
\[ E[(X - E[X])^2] \]
The second central moment is
\[ E[(X - E[X])^2] \]
Which is called variance
The second central moment is
\[ V[X] = E[(X - E[X])^2] \]
Which is called variance
Rearrange to something more applicable:
\[ V[X] = E[X^2] - E[X]^2 \]
The second central moment is
\[ V[X] = E[(X - E[X])^2] \]
Which is called variance
Rearrange to something more applicable:
\[ V[X] = E[X^2] - E[X]^2 \]
Other central moments are conceptually distinct but not as important for social science applications (more in the lab)
Variance is expressed in squared terms, but if you take the square root you get a measure in units of \(X\)
\[\sigma[X] = \sqrt{V[X]}\]
Which is called the standard deviation
Sometimes what we want to convey is the relationship between two random variables.
We can generalize the formula for variance to the bivariate case, which gives the covariance
\[ \text{Cov}[X,Y] = E[(X-E[X])(Y-E[Y])] \]
Alternative formula
\[ \text{Cov}{[X,Y]} = E[XY] - E[X]E[Y] \]
We can generalize the formula for variance to the bivariate case, which gives the covariance
\[ \text{Cov}[X,Y] = E[(X-E[X])(Y-E[Y])] \]
Alternative formula
\[ \text{Cov}{[X,Y]} = E[XY] - E[X]E[Y] \]
Correlation is rescaled covariance (0 to 1)
\[ \rho[X,Y] = \frac{\text{Cov}[X,Y]}{\sigma[X] \sigma[Y]} \]
\(\rho\) is Pearson’s correlation
\[ \rho[X,Y] = \frac{\text{Cov}[X,Y]}{\sigma[X] \sigma[Y]} \]
Spearman’s
\[ r_s = \rho[R[X], R[Y]] = \frac{\text{Cov}[R[X],R[Y]}{\sigma[R[X]] \sigma[R[Y]]} \]
Kendall’s
\[ \tau = \frac{(\text{# concordant pairs}) - (\text{# discordant pairs})} {(\text{# total pairs})} \]
concordant: \(x_i > x_j\) and \(y_i > y_j\) OR \(x_i < x_j\) and \(y_i < y_j\)
discordant otherwise
Variance rule
\[V[X+Y] = V[X] + \color{purple}{2\text{Cov}[X,Y]} + V[Y]\]
Unless \(\color{purple}{\text{Cov}[X,Y]} = 0\)
Meaning?
What does it mean for \(X\) and \(Y\) to be independent?
Knowing the outcome of one random variable provides no information about the probability of any outcome for the other.
\(\rho [X,Y] = 0\) does not imply independence!

Central moments are centered around the expected value
Example:
\[ V[X] = E[(X - E[X])^2] \]
But we can technically center them on anything we want to
\[ E[(X-\color{purple}c)^2] \]
\[ E[(X-\color{purple}c)^2] \]
This is called the
\[ E[(X-\color{purple}c)^2] \]
This is called the Mean
\[ E[(X-\color{purple}c)^2] \]
This is called the Mean Squared
\[ E[(X-\color{purple}c)^2] \]
This is called the Mean Squared Error
\[ E[(X-\color{purple}c)^2] \]
This is called the Mean Squared Error around \(\color{purple}{c}\)
\[ MSE = E[(X-\color{purple}c)^2] \]
Plugging \(\color{purple}{c = E[X]}\) minimizes MSE (cf. Theorems 2.1.23 and 2.1.24)
That’s because you can rearrange to
\[ E[(X-\color{purple}c)^2] = V[X] + (E[X]-\color{purple}c)^2 \]
\[ MSE = E[(X-\color{purple}c)^2] \]
Plugging \(\color{purple}{c = E[X]}\) minimizes MSE (cf. Theorems 2.1.23 and 2.1.24)
That’s because you can rearrange to
\[ E[(X-\color{purple}{E[X]})^2] = V[X] + (E[X]-\color{purple}{E[X]})^2 \]
\[ = V[X] \]
We can say that \(E[X]\) is the best predictor of \(X\) because it minimizes the MSE
This also extends to conditional expectation functions (CEF)
CEF \(E[Y|X]\) minimizes MSE of \(Y\) given \(X\)
Wait hold on
Discrete:
\[ E[Y|X=x] = \sum_yyf_{Y|X}(y|x) \]
Continuous:
\[ E[Y|X=x] = \int_{-\infty}^{+\infty}yf_{Y|X}(y|x)dy \]
Note that we can also have
\[ G_Y(X) = E[Y|X=X] = E[Y|X] \]
To denote the random variable that results from applying the CEF (which is technically many functions) to \(X\)
We can just move on with calling \(E[Y|X]\) the CEF
Linearity (same as with unconditional expectations)
Law of Iterated Expectations:\(E[Y] = E[E[Y|X]]\)
Linearity (same as with unconditional expectations)
Law of Iterated Expectations:\(E[Y] = E[E[Y|X]]\)
Law of Total Variance:
\[ V[Y] = \underbrace{E[V[Y|X]]}_\text{Avg variability within X} + \underbrace{V[E[Y|X]]}_\text{Variability across X} \]
CEF \(E[Y|X]\) minimizes MSE of \(Y\) given \(X\)
If we restrict ourselves to a linear functional form \(Y = a + bX\)
Then the following minimizes MSE of \(Y\) given \(X\):
Do you need to memorize these properties?
No, but a lot of the theory behind fancy methods relies on playing with the properties of expected values/moments of random variables
Down the line: Averages (and average-like things) have statistical properties that make them good one-number summaries (most of the time)
You need a strong case to not use means or conditional means