Sampling and simulation

Author
Published

September 20, 2024

Before we start

  • What is %>%?

  • A clear environment = a clear mind!

  • How to create a chunk?

  • How to run a line of code?

Download script

Random sampling from distributions

Generating data is a powerful skill used to model different situations, allowing for statistical inference. For example, based on distributions we can predict the probability of a specific value occurring.

Uniform distribution

Simple things first. For example, you have a task of generating random number between 0 and 1. How would computer do it? Random number generation is a slightly counter intuitive process. Basically, computer is extracting numbers from a uniform distribution. Let’s try it out.

Run this chunk a couple of times:

runif(10)
 [1] 0.88061036 0.13990718 0.69175141 0.13979542 0.46188173 0.55763305
 [7] 0.76615103 0.65960307 0.26947303 0.06477498

If you’d like to control the random generation process, you can use set.seed() function. Try this out:

set.seed(123)
runif(10)

set.seed(123)
runif(10)
 [1] 0.2875775 0.7883051 0.4089769 0.8830174 0.9404673 0.0455565 0.5281055
 [8] 0.8924190 0.5514350 0.4566147
 [1] 0.2875775 0.7883051 0.4089769 0.8830174 0.9404673 0.0455565 0.5281055
 [8] 0.8924190 0.5514350 0.4566147

You can adjust the boundaries however you would like to. Let’s simulate a million of observations.

set.seed(123)
uniform_data <- runif(n = 1000000, min = -3, max = 3)

Now, visualize what we got. It looks like a rectangle! In other words, every number is equally likely to appear. And that’s what the uniform distribution is.

library(tidyverse)

ggplot(data.frame(uniform_data), aes(x = uniform_data)) +
  geom_histogram(binwidth = 0.25, boundary = 0, closed = "right") +
  scale_x_continuous(breaks = seq(-5, 5, 1), limits = c(-5, 5))

Let’s consider a less theoretical example. You’d like to assign a student for next week’s presentation.

students_list <- c("student_1", "student_2", "student_3", "student_4", "student_5", "student_6")

To do this, you can use sample() function. You would like each student to appear with the same probability as any other student.

sample(students_list, size = 1)
[1] "student_5"

How fair is this random assignment? You can test it using R! To do this, we need to collect data. In our example, we can generate results easily. Take a look at how we modified sample(). replace = T argument puts the student back after they are drawn.

database <- data.frame("student" = sample(students_list, size = 10000, replace = T))
head(database)
    student
1 student_4
2 student_4
3 student_3
4 student_2
5 student_2
6 student_4

Now, check out the graph! Thus, without adjustment, sample() essentially follows a uniform distribution.

ggplot(database) +
  geom_histogram(aes(x = student), stat = "count") 

Normal distribution (Continuous Data)

However, most numbers that exist in the world tend to have higher probabilities around certain values—almost like gravity around a specific point. For instance, income in the United States is not uniformly distributed—a handful of people are really really rich, lots are very poor, and most are kind of clustered around an average.

One of such distributions is the “bell curve” - or normal distribution. Let’s generate a normal distribution data and visualize it.

set.seed(1234)
normal_data <- rnorm(1000)

normal_plot <- ggplot(data.frame(normal_data), aes(x = normal_data)) +
  geom_histogram(color = "white") 
normal_plot

Now, let’s calculate and plot the average of our data. It’s centered at 0! This is the first parameter of the normal distribution.

normal_plot + geom_vline(xintercept = mean(normal_data), color = "red")

But there is another parameter of the normal distribution which is the standard deviation (sd or \(\sigma\)). In simple terms, it means the spread of data from the average. For example, let’s highlight +- 1 sd from the average. Generally, it covers 68% of observations.

normal_plot + 
  geom_vline(xintercept = mean(normal_data) + sd(normal_data), color = "blue") +
  geom_vline(xintercept = mean(normal_data) - sd(normal_data), color = "blue") 

In statistics we are often interested in finding a “location” of 95% of observations. It is roughly within +- 2 sd.

normal_plot + 
  geom_vline(xintercept = mean(normal_data) + 2*sd(normal_data), color = "blue") +
  geom_vline(xintercept = mean(normal_data) - 2*sd(normal_data), color = "blue") 

Probability Density Function

Mathematicians - as usual - came up with formula to represent this kind of distribution abstractly. I don’t want to scare you, so we won’t consider the formula here. However, this formula allows us to draw the graph below. This is a generalized version of the normal distribution, it’s called Probability Density Function (PDF).

ggplot() +
  stat_function(fun = dnorm) +
  geom_vline(aes(xintercept = 1, color = "1 SD")) +
  geom_vline(aes(xintercept = -1, color = "1 SD")) +
  geom_vline(aes(xintercept = 2, color = "2 SD")) +
  geom_vline(aes(xintercept = -2, color = "2 SD")) +
  labs(title = "Normal Distribution (mean = 0, sd = 1)",
       subtitle = "Probability Density Function",
       color = "",
       y = "Probability") +
  scale_x_continuous(breaks = seq(-5, 5, by = 1), limits = c(-5, 5))

Using this, we can, say, calculate the probability of drawing “0” from a normal distribution with \(\mu = 0\) and \(\sigma = 1\)

dnorm(0, mean = 0, sd = 1)
[1] 0.3989423

Let’s check it on our normal_data we generated previously. I suggest changing binwidth for straightforward calculation. Does it remind you the PDF?

Do you remember what the binwidth argument does?

normal_plot +
  geom_histogram(binwidth = 1) +
  scale_x_continuous(breaks = seq(-5, 5, by = 1), limits = c(-5, 5))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

There are slightly less than 40% of observations fall into “0”. It matches our expectations! In other words, knowing the average and the standard deviation of the normal distribution, we can predict how many observations would be of given value.

Cumulative Distribution Function

What if we are interested in how many observations would be 0 or less? For this option we might use Cumulative Distribution Function.

Any guesses at this point? Take a look on the PDF!

Cumulative distribution function looks like this. This is kind of S shape.

ggplot() +
  geom_function(fun = pnorm) +
  xlim(-5,5) +
  labs(title = "Normal Distribution (mean = 0, sd = 1)",
       subtitle = "Cumulative Distribution Function",
       y = "Probability") 

Slightly harder is to calculate more precise values. For example, how many observations would be 1 or less? 84%!

pnorm(1, mean = 0, sd = 1)
[1] 0.8413447

On the graph, it would look like this:

ggplot(NULL, aes(c(-5,5))) +
  geom_area(stat = "function", fun = dnorm,  xlim = c(-1.35, 5)) +
  geom_area(stat = "function", fun = dnorm, fill="sky blue", xlim = c(-5, 1)) +
  labs(title = "Normal Distribution (mean = 0, sd = 1)",
       x = "",
       y = "Probability") +
  scale_x_continuous(breaks = seq(-5, 5, by = 1), limits = c(-5, 5))

Conversely, to know how many observation would be 1 or more we can use a trick:

1 - pnorm(1, mean = 0, sd = 1)
[1] 0.1586553

Binomial distribution (Discrete Data)

Often you’ll want to generate a column that only has two values: yes/no, treated/untreated, before/after, big/small, red/blue, etc. You’ll also likely want to control the proportions (25% treated, 62% blue, etc.).

For instance, we can ask R to do the following twenty times: flip a fair coin one hundred times, and count the number of tails.

rbinom(n = 20, size = 100, prob = 0.5)
 [1] 51 45 54 54 43 57 55 49 44 54 46 52 46 54 47 45 51 52 52 53

With prob = , we can implement unfair coins:

rbinom(n = 20, size = 100, prob = 0.9)
 [1] 94 90 87 90 94 92 93 84 92 91 93 95 93 94 86 93 90 87 92 89

For instance, pretend you want to simulate an election. According to the latest polls, one candidate has an 80% chance of winning. You want to randomly choose a winner based on that chance. Here’s how to do that with sample()

set.seed(1234)
candidates <- c("Person 1", "Person 2")
fake_elections <- data.frame(winner = sample(candidates,
                                         size = 1000,
                                         prob = c(0.8, 0.2),
                                         replace = TRUE))


ggplot(fake_elections, aes(x = winner)) +
  geom_histogram(stat = "count")

Other distribution functions

Each distribution can do more than generate random numbers (the prefix r). We can compute the cumulative probability by the function pbinom(), punif(), and pnorm(). Also the density – the value of the PDF – by dbinom(), dunif() and dnorm().

Random Sampling from Data

Let’s work with the real data a bit. Download results of World Happiness Report 2024 from the following URL. Load it to R!

h_report <- read.csv("data/happiness_report_2024.csv")

Visualize the happiness_score variable.

ggplot(h_report, aes(x = ...)) +
  geom_histogram(color = "white", binwidth = 1) +
  labs(title = "Distribution of happiness_score variable") 

Now, let’s sample 80 observations from the dataset. Basically, we are randomly extracting rows from the main dataset. Compare sample() and sample_n()

set.seed(1234)
h_report_sample <- sample_n(h_report, size = 80)

Now, compare the graphs of original data and of its sample.

ggplot(h_report_sample, aes(x = happiness_score)) +
  geom_histogram(color = "white", binwidth = 1) +
  labs(title = "Distribution of Sample of happiness_score variable")

We can check basic characteristics of those distributions. For example, their average. It’s pretty similar!

mean(h_report_sample$happiness_score)
mean(h_report$happiness_score)
[1] 5.487167
[1] 5.527572

Alternatively, we can check standard deviation. They are quite similar too.

sd(h_report_sample$happiness_score)
sd(h_report$happiness_score)
[1] 1.210437
[1] 1.17069

If we assume that true average \(\mu\) is equal to 5.53, then our estimate \(\hat{\mu}\) from the average of the data \(\bar{X}\) would be 5.49. In other words, even with the limited access to data we were able to figure average and sd of the “real” distribution.

Now, briefly. Let’s check how it matches with our theoretical assumptions of how the world works. We assume that the happiness scores are distributed normally. Thus, even with a limited sample knowledge (N=80) we can attempt to generalize on the whole population (N = 143). Let’s, say, calculate the probability that a given country would receive 6 on happiness score.

dnorm(x = 6,
      mean = mean(h_report_sample$happiness_score),
      sd = sd(h_report_sample$happiness_score))
[1] 0.3012936

Now, let’s calculate the real data. To make things easier, we add a count of countries in each bin.

ggplot(h_report, aes(x = happiness_score)) +
  geom_histogram(color = "white", binwidth = 1) +
  stat_bin(binwidth = 1, geom = "text", aes(label = ..count..), vjust = -0.5)

We got \(\approx\) 0.34, when we predicted 0.30. Sounds good with a limited access to data!

49/143
[1] 0.3426573
Exercise

You have access to diamonds dataset.

data(diamonds)

What are the variables in this dataset?

Using ggplot(), draw a histogram of the variable depth. Diamond depth describes its proportions. The less, the better.

ggplot(diamonds, aes(x=...))+
  geom_histogram() 

Does it remind you of normal distribution?

mean(diamonds$depth)
sd(diamonds$depth)
[1] 61.7494
[1] 1.432621

Draw a probability density function (dnorm()).

ggplot() +
  stat_function(fun = ..., args = list(mean = mean(...), sd = sd(...))) +
  xlim(40, 80)

Sample 1000 observations.

set.seed(123)
diamonds_sample = sample_n(diamonds, size = 1000)

Compare average and standard deviation. Are they comparable with the original data?

mean(...)
sd(...)

In your free time, feel free to experiment how few observations you need to get more or less reliable parameters of a normal distribution!

Bootstrapping

To get a better measure of \(\hat{\mu}\) we can apply bootstrapping. To put it simple, we are going to repeat sampling process several times, and each time we draw random sample from the dataset we record its average. First, we create a dataset which we’ll be using to store those averages.

sample_averages <- data.frame() # simply created empty dataset

Using loops, you can repeat actions several times. This is what programming is quite helpful in! Explore how a for loop below works.

  • for indicates the beginning of the loop

  • i stands for the index of an iteration

  • in 1:10 specifies the range of values from 1 to 10

for(i in 1:10){
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

Now, iterate (i.e. repeat) sampling process!

for(i in 1:100){
  temporary_sample <- sample_n(h_report, size = 80)                   # sample 80 observations
  temporary_sample_average <- mean(temporary_sample$happiness_score)  # calculate the sample average 
  sample_averages <- rbind(sample_averages, temporary_sample_average) # add the sample average to the df
}

colnames(sample_averages) = "average"
head(sample_averages)
   average
1 5.444772
2 5.782780
3 5.351790
4 5.519500
5 5.477656
6 5.479902

Take a look on the average of the collected averages. Did it get closer to the real parameter?

mean(sample_averages$average) 
[1] 5.531569
Exercise

In dataframe h_report, explore freedom variable. Check the descriptive statistics.

Draw a histogram of this variable using ggplot. Indicate the average

ggplot(...) +
  geom_histogram(aes(x = ...)) +
  geom_vline(xintercept = mean(..., na.rm = T))

Calculate the average.

mean(..., na.rm = T)

Using bootstrapping, get the comparable average.

First, create a dataframe to store the results of the averages.

bootsrap_averages = ...

Finish this for loop so it saves average of sampled data. Repeat the loop 10 times. Limit the size of sample_n to 50.

...(i in 1:...){
  temporary_sample <- sample_n(h_report, size = ...)
  temporary_sample_average <- ...
  bootsrap_averages <- rbind(bootsrap_averages, temporary_sample_average)
}

colnames(bootsrap_averages) = "average"

Compare the average of h_report$freedom and bootsrap_averages$average. Are they different?

Repeat the comparison, but change the number of iterations to 100. Did the result improve?

Automation and Custom functions

It is a frequent issue that when you work with data you would want to perform same several actions with a given dataset. For example, you would want to calculate average for specific set of variables, and then store it in a different dataframe to describe those selected. In this case, custom functions are incredibly helpful.

Explore the function below. It stores the averages for happiness_score, life_expectancy and freedom variables.

calculate_average <- function(dataframe){
  
  average_happiness = mean(dataframe$happiness_score)
  average_expectancy = mean(dataframe$life_expectancy, na.rm = T)
  average_freedom = mean(dataframe$freedom, na.rm = T)
  
  average_output = data.frame(average_happiness, average_expectancy, average_freedom)
  return(average_output) # This is what the function returns as a result
}

Try it out.

calculate_average(h_report)
  average_happiness average_expectancy average_freedom
1          5.527572           0.520843       0.6205882

Now, we can easily calculate those averages anytime we sample from the dataset.

calculate_average(h_report_sample)
  average_happiness average_expectancy average_freedom
1          5.487167          0.5148118       0.6285374
Exercise

Custom functions save a lot of time, especially if you have a non-trivial task. Write a function that would calculate median for happiness_score, life_expectancy and freedom variables in a given dataframe

calculate_median <- function(...){
  
  ...
  
  median_output = data.frame(...)
  return(median_output) # This is what the function returns as a result
}

And now try it out!

calculate_median(h_report)

Write a function that would automate calculation of average happiness in a selected group of countries from regions below.

post_soviet_countries <- c("Kazakhstan", "Uzbekistan", "Armenia")
latin_america_countries <- c("Venezuela", "Argentina", "Paraguay")

Use a draft below for your function.

average_happiness = ...(dataframe, countries_list){
  
  dataframe %>% 
    filter(country_name %in% ...) %>%
    summarize(average = mean(...)) %>%
    ...()
  
}

average_happiness(h_report, post_soviet_countries)            # Note the difference in (i) countries list
average_happiness(h_report_sample, latin_america_countries)   # and (ii) in dataframes we use

Apply function

Sometimes it is easier to use apply family of functions. For example, we can easily calculate an average using apply() function for selected columns.

h_report %>%
  select(happiness_score, log_gdp, social_support) %>%
  apply(2, mean, na.rm = TRUE)
happiness_score         log_gdp  social_support 
       5.527572        1.378779        1.134323 

Moreover, we can calculate some statistics for observations (i.e., rows), not variables. For example, we can find the minimum value across all variables for a specific country. As you might notice, the second argument in apply() function refers to whether we want manipulate rows or columns, 1 or 2 respectively.

h_report %>%
  filter(country_name == "Azerbaijan") %>%
  apply(1, min, na.rm = TRUE)
[1] "0.1115664"

However, most importantly, apply() function is compatible with custom functions. For example, you want to standardize the variable so it ranges between 0 to 1. You have the following function.

standartize = function(x){
  (x - min(x, na.rm = T)) / (max(x, na.rm = T) - min(x, na.rm = T))
}

Now, you can apply it over a number of variables.

h_report %>%
  select(social_support, freedom) %>%
  apply(2, standartize) %>%
  head()
     social_support   freedom
[1,]      0.9726830 0.9953240
[2,]      0.9405118 0.9529831
[3,]      1.0000000 0.9480820
[4,]      0.9284272 0.9709709
[5,]      0.9358299 0.7430091
[6,]      0.9044380 0.8398647

Map function

It’s quite often the case in programming that the same operation can be done in a multiple ways. Compare the map() function from tidyverse with apply() function. What is the difference?

h_report %>%
  select(happiness_score, log_gdp, social_support) %>% 
  map(mean, na.rm = T)
$happiness_score
[1] 5.527572

$log_gdp
[1] 1.378779

$social_support
[1] 1.134323

Now, take a look on the following code below.

h_report %>%
  select(happiness_score, log_gdp, social_support) %>% 
  map_df(mean, na.rm = T)
# A tibble: 1 × 3
  happiness_score log_gdp social_support
            <dbl>   <dbl>          <dbl>
1            5.53    1.38           1.13

Furthermore, it is compatible with custom functions, too.

h_report %>%
  select(happiness_score, log_gdp, social_support) %>% 
  map_df(standartize) %>%
  head()
# A tibble: 6 × 3
  happiness_score log_gdp social_support
            <dbl>   <dbl>          <dbl>
1           1       0.861          0.973
2           0.974   0.891          0.941
3           0.964   0.879          1    
4           0.934   0.877          0.928
5           0.934   0.842          0.936
6           0.930   0.888          0.904

Generally, this is an extensive toolkit to learn. But to sum up, apply() functions are base to R, and map() is an addition from tidyverse. If you are coding in a tidy way, then I suggest sticking to the tidy functions.

Compare apply() and map() to summarize_all() and summarize().

Exercise

Finish a function that would calculate an average based on two columns, namely social_support, corruption.

average_of_two = function(dataset){
  suppurt_average = mean(dataset$..., na.rm = T)
  corruptuon_average = mean(..., na.rm = T)
  
  average = mean(c(...))
  return(average)
}

average_of_two(h_report)

Do you remember mtcars data? Load it

cars_information <- mtcars

Now, using apply() calculate average for each column.

Now, using map() calculate median for each column.

Using the loop draft below to sample 10 cars from cars dataset 15 times. Each time, apply map_df() function to calculate the average mpg and wt, and store the result in the sample_statistics dataframe.

sample_statistics = data.frame()

for(i in 1:...){
  temporary_sample <- sample_n(cars_information, size = ...)   
  
  temporary_sample_average <- temporary_sample %>% 
    select(...) %>% 
    ...(mean)  
  
  sample_statistics <- rbind(sample_statistics, ...) 
}

sample_statistics

What should you do if you have questions about programming

  • Stack overflow

  • Your peers

  • The team of this math camp

Check List

Standard deviation, PDF and CDF do not scare me

I know what sampling is, and on top of this I can use bootstrapping

I can understand custom functions and loops

What did we learn over this week in the programming part?

  • What R, RStudio, Quarto and Markdown are

  • Creating and working with objects, vectors and dataframes

  • Logical Operators

  • What Descriptive statistics is

  • Data wrangling, including mutate(), select(), filter(), summarize(), case_when(), etc.

  • Data Visualization. Now we know what histograms and scatterplots are

  • We can load data in R and work with it!

Sources

  • Georgia State University, Andrew Young School of Policy Studies, Program Evaluation, https://evalsp24.classes.andrewheiss.com/

  • UT Austin, Department of Government, Methods Camp, https://methodscamp.github.io/

  • Harvard University Department of Government, Math Prefresher, https://iqss.github.io/prefresher/