1

I have an unbalanced dataset with people from liberal and conservative background giving rating on an issue (1-7). Would like to see how polarized the issue is.

The sample is heavily skewed towards liberal (70% of the sample). How do I do repeated sampling using R to create a balanced sample (50-50) and calculate kurtosis?

For example, I have total 50 conservatives. How do I randomly sample 50 liberals out of 150 repeatedly?

A sample dataframe below:

  political_ort   rating  
    liberal         1 
    liberal         6 
    conservative    5   
    conservative    3   
    liberal         7  
    liberal         3 
    liberal         1
Yvonne
  • 81
  • 6
  • Does this answer your question? [Sampling from a data.frame while controlling for a proportion \[stratified sampling\]](https://stackoverflow.com/questions/29360799/sampling-from-a-data-frame-while-controlling-for-a-proportion-stratified-sampli) – jared_mamrot Jan 28 '21 at 23:46
  • 1
    Thanks! Not really. I'm looking to sample the same number of liberals as conservatives. So if there are 10 conservatives, would like to sample 10 from 70 liberals repeatly. – Yvonne Jan 29 '21 at 01:45

1 Answers1

2

What you're describing is termed 'undersampling'. Here is one method using tidyverse functions:

# Load library
library(tidyverse)

# Create some 'test' (fake) data
sample_df <- data_frame(id_number = (1:100),
                        political_ort = c(rep("liberal", 70),
                                          rep("conservative", 30)),
                        ratings = sample(1:7, size = 100, replace = TRUE))

# Take the fake data
undersampled_df <- sample_df %>% 
# Group the data by category (liberal / conservative) to treat them separately
  group_by(political_ort) %>% 
# And randomly sample 30 rows from each category (liberal / conservative)
  sample_n(size = 30, replace = FALSE) %>%
# Because there are only 30 conservatives in total they are all included
# Finally, ungroup the data so it goes back to a 'vanilla' dataframe/tibble
  ungroup()
# You can see the id_numbers aren't in order anymore indicating the sampling was random

There is also the ROSE package that has a function ("ovun.sample") that can do this for you: https://www.rdocumentation.org/packages/ROSE/versions/0.0-3/topics/ovun.sample

jared_mamrot
  • 22,354
  • 4
  • 21
  • 46