1

I have data sets that have 1 to 70 columns of data with 1 to 5 columns of ID variables. I need to group by the ID variables and then randomly sample the rows of data so that the re-sampled data set is the same length as the original data set. Below is and example DATA set with the desired RESULT table.

So I need to group_by SITE and DATE and then randomly sample a single row from STUFF:STUFF3. Please note how the RESULT table retains the order of data across the columns of STUFF:STUFF3. For example the first two rows in the RESULT table are both 2,4,8 which corresponds to row 2 in the DATA table.

I have code that subsets in a for loop, but I would prefer to use dplyr. I hope this is clear. Thanks.

DATA = data.frame(SITE = c("A","A","A","A","B","B","B","C","C"), 
                  DATE = c("1","1","2","2","3","3","3","4","4"), 
                  STUFF = c(1, 2, 30, 40, 100, 200, 300, 5000, 6000),
                  STUFF2 = c(2, 4, 60, 80, 200, 400, 600, 10000, 12000),
                  STUFF3 = c(4, 8, 120, 160, 400, 800, 1200, 20000, 24000))



RESULT = data.frame(SITE = c("A","A","A","A","B","B","B","C","C"), 
                    DATE = c("1","1","2","2","3","3","3","4","4"), 
                    STUFF = c(2, 2, 30, 30, 200, 300, 300, 6000, 5000),
                    STUFF2 = c(4, 4, 60, 60, 400, 600, 600, 12000, 10000),
                    STUFF3 = c(8, 8, 120, 120, 800, 1200, 1200, 24000, 20000))
Vesuccio
  • 607
  • 2
  • 6
  • 17

1 Answers1

3

Here's a quick data.table solution. You are basically want to sample rows and extract a whole data set according to this sample. You can use .N in order to determine each group size and then extract a whole data set using sample within the .SD function, for example (use set.seed in order to reproduce exact results from sample calls):

library(data.table)
setDT(DATA)[, .SD[sample(.N)], by = .(SITE, DATE)]
##   SITE DATE STUFF STUFF2 STUFF3
##1:    A    1     1      2      4
##2:    A    1     2      4      8
##3:    A    2    30     60    120
##4:    A    2    40     80    160
##5:    B    3   100    200    400
##6:    B    3   300    600   1200
##7:    B    3   200    400    800
##8:    C    4  6000  12000  24000
##9:    C    4  5000  10000  20000

Alternately, If you insist on using dplyr, you can use n() and slice in order to achieve same effect

library(dplyr)
DATA %>%
  group_by(SITE, DATE) %>%
  slice(sample(n()))
# Source: local data table [9 x 5]
# Groups: SITE, DATE
# 
#   SITE DATE STUFF STUFF2 STUFF3
# 1    A    1     2      4      8
# 2    A    1     1      2      4
# 3    A    2    30     60    120
# 4    A    2    40     80    160
# 5    B    3   200    400    800
# 6    B    3   100    200    400
# 7    B    3   300    600   1200
# 8    C    4  6000  12000  24000
# 9    C    4  5000  10000  20000
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • Thanks David! That is the right stuff. I have been using `dplyr` extensively to mine my data sets. Does the `data.table` package have any distinct advantages over `dplyr'? – Vesuccio Mar 12 '15 at 22:55
  • 1
    Not a simple question to answer in comments... Luckily it was already asked and answered by its two developers, see [here](http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly) – David Arenburg Mar 12 '15 at 22:56
  • I will look into it. Thanks for the link! Take care. – Vesuccio Mar 12 '15 at 22:58
  • @Vesuccio: For such a small dataset it's not going to matter. You should pick the method that easiest to understand. – IRTFM Mar 12 '15 at 23:38