1

This is a combination of two questions ( Repeat the re-sampling function for 1000 times ? Using lapply? and How do you sample groups in a data.table with a caveat).

The goal is to sample groups in a data.table, but repeat this process "n" times and pull the average for each row-value. For example:

#generate the data
DT = data.table(a=c(1,1,1,1:15,1,1), b=sample(1:1000,20))

#sample the data as done in the second linked question
DT[,.SD[sample(.N,min(.N,3))],by = a]
     a   b
 1:  1 288
 2:  1 881
 3:  1 409
 4:  2 937
 5:  3  46
 6:  4 525
 7:  5 887
 8:  6 548
 9:  7 453
10:  8 948
11:  9 449
12: 10 670
13: 11 566
14: 12 102
15: 13 993
16: 14 243
17: 15  42

Now here is my attempt using the answer given in the first-linked question:

x <- replicate(100,{DT[,.SD[sample(.N,min(.N,3))],by = a]})

This returns a list "x" with each repetition. The only way I can think of accessing the repetitions is by this:

# repetition 1 col-a values
x[[1]]
# repetition 1 col-b values
x[[2]]
# repetition 2 col-a values
x[[3]]
# repetition 2 col-b values
x[[4]]

So in order to achieve the average for each row, I would have to find the mean of x[[j]] where j goes from seq(2,200,2) where 200 is the number of replications*2.

Is there an easier way of doing this? I have tried using this solution (https://stats.stackexchange.com/questions/8225/how-to-summarize-data-by-group-in-r) in this fashion:

y <- DT[,.SD[sample(.N,min(.N,3))],by = a]
y[,list(mean=mean(b)),by=a]
     a mean
 1:  1  550
 2:  2  849
 3:  3  603
 4:  4   77
 5:  5  973
 6:  6  746
 7:  7  919
 8:  8  655
 9:  9  883
10: 10  823
11: 11  533
12: 12  483
13: 13   53
14: 14  827
15: 15  413

But I have yet to be able to do this with the replication process. Any help would be great!

Community
  • 1
  • 1
road_to_quantdom
  • 1,341
  • 1
  • 13
  • 20

1 Answers1

1

Something like this??

Based on your comments, you want means by group for each replicate, so in this example 15 * 100 means. Here are two ways to do that.

library(data.table)
set.seed(1) # for reproducibility
DT = data.table(a=c(1,1,1,1:15,1,1), b=sample(1:1000,20))
x <- replicate(100,{DT[,.SD[sample(.N,min(.N,3))],by = a]})

indx <- seq(1,length(x),2)
result.1 <- mapply(function(a,b)aggregate(b,list(a),mean)$x,x[indx],x[indx+1])
str(result.1)
#  num [1:15, 1:100] 569 201 894 940 657 625 62 204 175 679 ...
result.2 <- sapply(x[indx+1],function(b)aggregate(b,x[1],mean)$x)
identical(result.1,result.2)
# [1] TRUE

Both methods produce a 15 X 100 matrix of means, with the groups in rows and the replicates in columns. The second approach takes advantage of fact the a column is the same for all replicates.

jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • yeah it does simplify when "a" is not needed anymore. What if in the context of another problem we would like to keep it as a rowname or something? – road_to_quantdom Dec 10 '14 at 21:34
  • Then use `sapply(seq(2,length(x),2),function(i)mean(x[[i]]))`?? – jlhoward Dec 10 '14 at 21:48
  • the thing is, I don't want to average the ENTIRE list. I want to average by group. And in this example a indicates the group – road_to_quantdom Dec 10 '14 at 23:22
  • So do you want the mean for each replicate for each group?? For a total of 100*15 means in your example?? – jlhoward Dec 11 '14 at 00:13
  • Yes, so an average b value for every unique a value. – road_to_quantdom Dec 11 '14 at 00:17
  • 1
    I think what I want to do is using the "group by mean" stackoverflow answer to the list "x". After I replicate the process 100 times, I want to take the average for each unique "a" value. Basically, using this technique: http://stats.stackexchange.com/questions/8225/how-to-summarize-data-by-group-in-r – road_to_quantdom Dec 19 '14 at 01:01