0

I have a dataset of around 200k rows that looks like this:

Report ID | Month | Day | Year | Location ID | comments
1             4       1    2015       200          blah blah blah
2            11       3    2014       100          blah blah blah 
3             4       5    2015       203          blah blah blah
4             8      30    2012       204          blah blah blah
5            11       5    2013       204          blah blah blah
6            11       1    2015       100          blah blah blah  
7            11      10    2013       204          blah blah blah

I need to create a random sample of report IDs that has an even distribution of location IDs, year, and months. I know this wouldn't truly be a random sample, but location ID skews pretty heavily to some locations and some months have way more reports than others.

I have tried various sampling and sub setting techniques in R, but they all seem to want to sample the data set as a whole and I've been unable to locate a way where I can ask the sample to provide say 500 report ids for each location. Let alone be able to then say, within this 500, I want an even distribution of years and months. Any suggestions?

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
J.Gorman
  • 5
  • 5

1 Answers1

0

I was able to get there with dplyr and following the lead from the comment left by Mr.Joshuagordon.

mtcars %>% 
    group_by(cyl) %>%
    do(sample_n(.,2))

sample rows of subgroups from dataframe with dplyr

Community
  • 1
  • 1
J.Gorman
  • 5
  • 5