runif(10)
[1] 0.88061036 0.13990718 0.69175141 0.13979542 0.46188173 0.55763305
[7] 0.76615103 0.65960307 0.26947303 0.06477498
What is %>%
?
A clear environment = a clear mind!
How to create a chunk?
How to run a line of code?
Generating data is a powerful skill used to model different situations, allowing for statistical inference. For example, based on distributions we can predict the probability of a specific value occurring.
Simple things first. For example, you have a task of generating random number between 0 and 1. How would computer do it? Random number generation is a slightly counter intuitive process. Basically, computer is extracting numbers from a uniform distribution. Let’s try it out.
Run this chunk a couple of times:
runif(10)
[1] 0.88061036 0.13990718 0.69175141 0.13979542 0.46188173 0.55763305
[7] 0.76615103 0.65960307 0.26947303 0.06477498
If you’d like to control the random generation process, you can use set.seed()
function. Try this out:
set.seed(123)
runif(10)
set.seed(123)
runif(10)
[1] 0.2875775 0.7883051 0.4089769 0.8830174 0.9404673 0.0455565 0.5281055
[8] 0.8924190 0.5514350 0.4566147
[1] 0.2875775 0.7883051 0.4089769 0.8830174 0.9404673 0.0455565 0.5281055
[8] 0.8924190 0.5514350 0.4566147
You can adjust the boundaries however you would like to. Let’s simulate a million of observations.
set.seed(123)
<- runif(n = 1000000, min = -3, max = 3) uniform_data
Now, visualize what we got. It looks like a rectangle! In other words, every number is equally likely to appear. And that’s what the uniform distribution is.
library(tidyverse)
ggplot(data.frame(uniform_data), aes(x = uniform_data)) +
geom_histogram(binwidth = 0.25, boundary = 0, closed = "right") +
scale_x_continuous(breaks = seq(-5, 5, 1), limits = c(-5, 5))
Let’s consider a less theoretical example. You’d like to assign a student for next week’s presentation.
<- c("student_1", "student_2", "student_3", "student_4", "student_5", "student_6") students_list
To do this, you can use sample()
function. You would like each student to appear with the same probability as any other student.
sample(students_list, size = 1)
[1] "student_5"
How fair is this random assignment? You can test it using R! To do this, we need to collect data. In our example, we can generate results easily. Take a look at how we modified sample()
. replace = T
argument puts the student back after they are drawn.
<- data.frame("student" = sample(students_list, size = 10000, replace = T))
database head(database)
student
1 student_4
2 student_4
3 student_3
4 student_2
5 student_2
6 student_4
Now, check out the graph! Thus, without adjustment, sample()
essentially follows a uniform distribution.
ggplot(database) +
geom_histogram(aes(x = student), stat = "count")
However, most numbers that exist in the world tend to have higher probabilities around certain values—almost like gravity around a specific point. For instance, income in the United States is not uniformly distributed—a handful of people are really really rich, lots are very poor, and most are kind of clustered around an average.
One of such distributions is the “bell curve” - or normal distribution. Let’s generate a normal distribution data and visualize it.
set.seed(1234)
<- rnorm(1000)
normal_data
<- ggplot(data.frame(normal_data), aes(x = normal_data)) +
normal_plot geom_histogram(color = "white")
normal_plot
Now, let’s calculate and plot the average of our data. It’s centered at 0! This is the first parameter of the normal distribution.
+ geom_vline(xintercept = mean(normal_data), color = "red") normal_plot
But there is another parameter of the normal distribution which is the standard deviation (sd or \(\sigma\)). In simple terms, it means the spread of data from the average. For example, let’s highlight +- 1 sd from the average. Generally, it covers 68% of observations.
+
normal_plot geom_vline(xintercept = mean(normal_data) + sd(normal_data), color = "blue") +
geom_vline(xintercept = mean(normal_data) - sd(normal_data), color = "blue")
In statistics we are often interested in finding a “location” of 95% of observations. It is roughly within +- 2 sd.
+
normal_plot geom_vline(xintercept = mean(normal_data) + 2*sd(normal_data), color = "blue") +
geom_vline(xintercept = mean(normal_data) - 2*sd(normal_data), color = "blue")
Mathematicians - as usual - came up with formula to represent this kind of distribution abstractly. I don’t want to scare you, so we won’t consider the formula here. However, this formula allows us to draw the graph below. This is a generalized version of the normal distribution, it’s called Probability Density Function (PDF).
ggplot() +
stat_function(fun = dnorm) +
geom_vline(aes(xintercept = 1, color = "1 SD")) +
geom_vline(aes(xintercept = -1, color = "1 SD")) +
geom_vline(aes(xintercept = 2, color = "2 SD")) +
geom_vline(aes(xintercept = -2, color = "2 SD")) +
labs(title = "Normal Distribution (mean = 0, sd = 1)",
subtitle = "Probability Density Function",
color = "",
y = "Probability") +
scale_x_continuous(breaks = seq(-5, 5, by = 1), limits = c(-5, 5))
Using this, we can, say, calculate the probability of drawing “0” from a normal distribution with \(\mu = 0\) and \(\sigma = 1\)
dnorm(0, mean = 0, sd = 1)
[1] 0.3989423
Let’s check it on our normal_data
we generated previously. I suggest changing binwidth
for straightforward calculation. Does it remind you the PDF?
Do you remember what the binwidth argument does?
+
normal_plot geom_histogram(binwidth = 1) +
scale_x_continuous(breaks = seq(-5, 5, by = 1), limits = c(-5, 5))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).
There are slightly less than 40% of observations fall into “0”. It matches our expectations! In other words, knowing the average and the standard deviation of the normal distribution, we can predict how many observations would be of given value.
What if we are interested in how many observations would be 0 or less? For this option we might use Cumulative Distribution Function.
Any guesses at this point? Take a look on the PDF!
Cumulative distribution function looks like this. This is kind of S shape.
ggplot() +
geom_function(fun = pnorm) +
xlim(-5,5) +
labs(title = "Normal Distribution (mean = 0, sd = 1)",
subtitle = "Cumulative Distribution Function",
y = "Probability")
Slightly harder is to calculate more precise values. For example, how many observations would be 1 or less? 84%!
pnorm(1, mean = 0, sd = 1)
[1] 0.8413447
On the graph, it would look like this:
ggplot(NULL, aes(c(-5,5))) +
geom_area(stat = "function", fun = dnorm, xlim = c(-1.35, 5)) +
geom_area(stat = "function", fun = dnorm, fill="sky blue", xlim = c(-5, 1)) +
labs(title = "Normal Distribution (mean = 0, sd = 1)",
x = "",
y = "Probability") +
scale_x_continuous(breaks = seq(-5, 5, by = 1), limits = c(-5, 5))
Conversely, to know how many observation would be 1 or more we can use a trick:
1 - pnorm(1, mean = 0, sd = 1)
[1] 0.1586553
Often you’ll want to generate a column that only has two values: yes/no, treated/untreated, before/after, big/small, red/blue, etc. You’ll also likely want to control the proportions (25% treated, 62% blue, etc.).
For instance, we can ask R to do the following twenty times: flip a fair coin one hundred times, and count the number of tails.
rbinom(n = 20, size = 100, prob = 0.5)
[1] 51 45 54 54 43 57 55 49 44 54 46 52 46 54 47 45 51 52 52 53
With prob =
, we can implement unfair coins:
rbinom(n = 20, size = 100, prob = 0.9)
[1] 94 90 87 90 94 92 93 84 92 91 93 95 93 94 86 93 90 87 92 89
For instance, pretend you want to simulate an election. According to the latest polls, one candidate has an 80% chance of winning. You want to randomly choose a winner based on that chance. Here’s how to do that with sample()
set.seed(1234)
<- c("Person 1", "Person 2")
candidates <- data.frame(winner = sample(candidates,
fake_elections size = 1000,
prob = c(0.8, 0.2),
replace = TRUE))
ggplot(fake_elections, aes(x = winner)) +
geom_histogram(stat = "count")
Each distribution can do more than generate random numbers (the prefix r). We can compute the cumulative probability by the function pbinom()
, punif()
, and pnorm()
. Also the density – the value of the PDF – by dbinom()
, dunif()
and dnorm()
.
Let’s work with the real data a bit. Download results of World Happiness Report 2024 from the following URL. Load it to R!
<- read.csv("data/happiness_report_2024.csv") h_report
Visualize the happiness_score
variable.
ggplot(h_report, aes(x = ...)) +
geom_histogram(color = "white", binwidth = 1) +
labs(title = "Distribution of happiness_score variable")
Now, let’s sample 80 observations from the dataset. Basically, we are randomly extracting rows from the main dataset. Compare sample()
and sample_n()
set.seed(1234)
<- sample_n(h_report, size = 80) h_report_sample
Now, compare the graphs of original data and of its sample.
ggplot(h_report_sample, aes(x = happiness_score)) +
geom_histogram(color = "white", binwidth = 1) +
labs(title = "Distribution of Sample of happiness_score variable")
We can check basic characteristics of those distributions. For example, their average. It’s pretty similar!
mean(h_report_sample$happiness_score)
mean(h_report$happiness_score)
[1] 5.487167
[1] 5.527572
Alternatively, we can check standard deviation. They are quite similar too.
sd(h_report_sample$happiness_score)
sd(h_report$happiness_score)
[1] 1.210437
[1] 1.17069
If we assume that true average \(\mu\) is equal to 5.53, then our estimate \(\hat{\mu}\) from the average of the data \(\bar{X}\) would be 5.49. In other words, even with the limited access to data we were able to figure average and sd of the “real” distribution.
Now, briefly. Let’s check how it matches with our theoretical assumptions of how the world works. We assume that the happiness scores are distributed normally. Thus, even with a limited sample knowledge (N=80) we can attempt to generalize on the whole population (N = 143). Let’s, say, calculate the probability that a given country would receive 6 on happiness score.
dnorm(x = 6,
mean = mean(h_report_sample$happiness_score),
sd = sd(h_report_sample$happiness_score))
[1] 0.3012936
Now, let’s calculate the real data. To make things easier, we add a count of countries in each bin.
ggplot(h_report, aes(x = happiness_score)) +
geom_histogram(color = "white", binwidth = 1) +
stat_bin(binwidth = 1, geom = "text", aes(label = ..count..), vjust = -0.5)
We got \(\approx\) 0.34, when we predicted 0.30. Sounds good with a limited access to data!
49/143
[1] 0.3426573
You have access to diamonds
dataset.
data(diamonds)
What are the variables in this dataset?
Using ggplot()
, draw a histogram of the variable depth
. Diamond depth describes its proportions. The less, the better.
ggplot(diamonds, aes(x=...))+
geom_histogram()
Does it remind you of normal distribution?
mean(diamonds$depth)
sd(diamonds$depth)
[1] 61.7494
[1] 1.432621
Draw a probability density function (dnorm()
).
ggplot() +
stat_function(fun = ..., args = list(mean = mean(...), sd = sd(...))) +
xlim(40, 80)
Sample 1000 observations.
set.seed(123)
= sample_n(diamonds, size = 1000) diamonds_sample
Compare average and standard deviation. Are they comparable with the original data?
mean(...)
sd(...)
In your free time, feel free to experiment how few observations you need to get more or less reliable parameters of a normal distribution!
To get a better measure of \(\hat{\mu}\) we can apply bootstrapping. To put it simple, we are going to repeat sampling process several times, and each time we draw random sample from the dataset we record its average. First, we create a dataset which we’ll be using to store those averages.
<- data.frame() # simply created empty dataset sample_averages
Using loops, you can repeat actions several times. This is what programming is quite helpful in! Explore how a for
loop below works.
for
indicates the beginning of the loop
i
stands for the index of an iteration
in 1:10
specifies the range of values from 1 to 10
for(i in 1:10){
print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
Now, iterate (i.e. repeat) sampling process!
for(i in 1:100){
<- sample_n(h_report, size = 80) # sample 80 observations
temporary_sample <- mean(temporary_sample$happiness_score) # calculate the sample average
temporary_sample_average <- rbind(sample_averages, temporary_sample_average) # add the sample average to the df
sample_averages
}
colnames(sample_averages) = "average"
head(sample_averages)
average
1 5.444772
2 5.782780
3 5.351790
4 5.519500
5 5.477656
6 5.479902
Take a look on the average of the collected averages. Did it get closer to the real parameter?
mean(sample_averages$average)
[1] 5.531569
In dataframe h_report
, explore freedom
variable. Check the descriptive statistics.
Draw a histogram of this variable using ggplot. Indicate the average
ggplot(...) +
geom_histogram(aes(x = ...)) +
geom_vline(xintercept = mean(..., na.rm = T))
Calculate the average.
mean(..., na.rm = T)
Using bootstrapping, get the comparable average.
First, create a dataframe to store the results of the averages.
= ... bootsrap_averages
Finish this for
loop so it saves average of sampled data. Repeat the loop 10 times. Limit the size of sample_n
to 50.
...(i in 1:...){
<- sample_n(h_report, size = ...)
temporary_sample <- ...
temporary_sample_average <- rbind(bootsrap_averages, temporary_sample_average)
bootsrap_averages
}
colnames(bootsrap_averages) = "average"
Compare the average of h_report$freedom
and bootsrap_averages$average
. Are they different?
Repeat the comparison, but change the number of iterations to 100. Did the result improve?
It is a frequent issue that when you work with data you would want to perform same several actions with a given dataset. For example, you would want to calculate average for specific set of variables, and then store it in a different dataframe to describe those selected. In this case, custom functions are incredibly helpful.
Explore the function below. It stores the averages for happiness_score
, life_expectancy
and freedom
variables.
<- function(dataframe){
calculate_average
= mean(dataframe$happiness_score)
average_happiness = mean(dataframe$life_expectancy, na.rm = T)
average_expectancy = mean(dataframe$freedom, na.rm = T)
average_freedom
= data.frame(average_happiness, average_expectancy, average_freedom)
average_output return(average_output) # This is what the function returns as a result
}
Try it out.
calculate_average(h_report)
average_happiness average_expectancy average_freedom
1 5.527572 0.520843 0.6205882
Now, we can easily calculate those averages anytime we sample from the dataset.
calculate_average(h_report_sample)
average_happiness average_expectancy average_freedom
1 5.487167 0.5148118 0.6285374
Custom functions save a lot of time, especially if you have a non-trivial task. Write a function that would calculate median for happiness_score
, life_expectancy
and freedom
variables in a given dataframe
<- function(...){
calculate_median
...
= data.frame(...)
median_output return(median_output) # This is what the function returns as a result
}
And now try it out!
calculate_median(h_report)
Write a function that would automate calculation of average happiness in a selected group of countries from regions below.
<- c("Kazakhstan", "Uzbekistan", "Armenia")
post_soviet_countries <- c("Venezuela", "Argentina", "Paraguay") latin_america_countries
Use a draft below for your function.
= ...(dataframe, countries_list){
average_happiness
%>%
dataframe filter(country_name %in% ...) %>%
summarize(average = mean(...)) %>%
...()
}
average_happiness(h_report, post_soviet_countries) # Note the difference in (i) countries list
average_happiness(h_report_sample, latin_america_countries) # and (ii) in dataframes we use
Sometimes it is easier to use apply family of functions. For example, we can easily calculate an average using apply()
function for selected columns.
%>%
h_report select(happiness_score, log_gdp, social_support) %>%
apply(2, mean, na.rm = TRUE)
happiness_score log_gdp social_support
5.527572 1.378779 1.134323
Moreover, we can calculate some statistics for observations (i.e., rows), not variables. For example, we can find the minimum value across all variables for a specific country. As you might notice, the second argument in apply()
function refers to whether we want manipulate rows or columns, 1 or 2 respectively.
%>%
h_report filter(country_name == "Azerbaijan") %>%
apply(1, min, na.rm = TRUE)
[1] "0.1115664"
However, most importantly, apply()
function is compatible with custom functions. For example, you want to standardize the variable so it ranges between 0 to 1. You have the following function.
= function(x){
standartize - min(x, na.rm = T)) / (max(x, na.rm = T) - min(x, na.rm = T))
(x }
Now, you can apply it over a number of variables.
%>%
h_report select(social_support, freedom) %>%
apply(2, standartize) %>%
head()
social_support freedom
[1,] 0.9726830 0.9953240
[2,] 0.9405118 0.9529831
[3,] 1.0000000 0.9480820
[4,] 0.9284272 0.9709709
[5,] 0.9358299 0.7430091
[6,] 0.9044380 0.8398647
It’s quite often the case in programming that the same operation can be done in a multiple ways. Compare the map()
function from tidyverse with apply()
function. What is the difference?
%>%
h_report select(happiness_score, log_gdp, social_support) %>%
map(mean, na.rm = T)
$happiness_score
[1] 5.527572
$log_gdp
[1] 1.378779
$social_support
[1] 1.134323
Now, take a look on the following code below.
%>%
h_report select(happiness_score, log_gdp, social_support) %>%
map_df(mean, na.rm = T)
# A tibble: 1 × 3
happiness_score log_gdp social_support
<dbl> <dbl> <dbl>
1 5.53 1.38 1.13
Furthermore, it is compatible with custom functions, too.
%>%
h_report select(happiness_score, log_gdp, social_support) %>%
map_df(standartize) %>%
head()
# A tibble: 6 × 3
happiness_score log_gdp social_support
<dbl> <dbl> <dbl>
1 1 0.861 0.973
2 0.974 0.891 0.941
3 0.964 0.879 1
4 0.934 0.877 0.928
5 0.934 0.842 0.936
6 0.930 0.888 0.904
Generally, this is an extensive toolkit to learn. But to sum up, apply()
functions are base to R, and map()
is an addition from tidyverse. If you are coding in a tidy way, then I suggest sticking to the tidy functions.
Compare apply()
and map()
to summarize_all()
and summarize()
.
Finish a function that would calculate an average based on two columns, namely social_support
, corruption
.
= function(dataset){
average_of_two = mean(dataset$..., na.rm = T)
suppurt_average = mean(..., na.rm = T)
corruptuon_average
= mean(c(...))
average return(average)
}
average_of_two(h_report)
Do you remember mtcars
data? Load it
<- mtcars cars_information
Now, using apply()
calculate average for each column.
Now, using map()
calculate median for each column.
Using the loop draft below to sample 10 cars from cars dataset 15 times. Each time, apply map_df()
function to calculate the average mpg
and wt
, and store the result in the sample_statistics
dataframe.
= data.frame()
sample_statistics
for(i in 1:...){
<- sample_n(cars_information, size = ...)
temporary_sample
<- temporary_sample %>%
temporary_sample_average select(...) %>%
...(mean)
<- rbind(sample_statistics, ...)
sample_statistics
}
sample_statistics
Stack overflow
Your peers
The team of this math camp
Standard deviation, PDF and CDF do not scare me
I know what sampling is, and on top of this I can use bootstrapping
I can understand custom functions and loops
What R, RStudio, Quarto and Markdown are
Creating and working with objects, vectors and dataframes
Logical Operators
What Descriptive statistics is
Data wrangling, including mutate()
, select()
, filter()
, summarize()
, case_when()
, etc.
Data Visualization. Now we know what histograms and scatterplots are
We can load data in R and work with it!
Georgia State University, Andrew Young School of Policy Studies, Program Evaluation, https://evalsp24.classes.andrewheiss.com/
UT Austin, Department of Government, Methods Camp, https://methodscamp.github.io/
Harvard University Department of Government, Math Prefresher, https://iqss.github.io/prefresher/