Random samples but grouped by certain values in columns

Question

I have a dataset of around 200k rows that looks like this:

Report ID | Month | Day | Year | Location ID | comments
1             4       1    2015       200          blah blah blah
2            11       3    2014       100          blah blah blah 
3             4       5    2015       203          blah blah blah
4             8      30    2012       204          blah blah blah
5            11       5    2013       204          blah blah blah
6            11       1    2015       100          blah blah blah  
7            11      10    2013       204          blah blah blah

I need to create a random sample of report IDs that has an even distribution of location IDs, year, and months. I know this wouldn't truly be a random sample, but location ID skews pretty heavily to some locations and some months have way more reports than others.

I have tried various sampling and sub setting techniques in R, but they all seem to want to sample the data set as a whole and I've been unable to locate a way where I can ask the sample to provide say 500 report ids for each location. Let alone be able to then say, within this 500, I want an even distribution of years and months. Any suggestions?

http://stackoverflow.com/questions/21255366/sample-rows-of-subgroups-from-dataframe-with-dplyr — mr.joshuagordon, Nov 30 '16 at 23:08
Have you seen [this](http://stackoverflow.com/questions/23479512/stratified-random-sampling-from-data-frame-in-r)? — David., Dec 01 '16 at 07:20

score 0 · Accepted Answer · edited May 23 '17 at 12:24

0

I was able to get there with dplyr and following the lead from the comment left by Mr.Joshuagordon.

mtcars %>% 
    group_by(cyl) %>%
    do(sample_n(.,2))

sample rows of subgroups from dataframe with dplyr

edited May 23 '17 at 12:24

Community

1
1

answered Dec 05 '16 at 16:36

J.Gorman

5
5

Random samples but grouped by certain values in columns

1 Answers1