3

This is my first post so please bear with me. Below is a small sample of my data. My actual dataset has over 4,000 individual IDs and each ID can have anywhere from one to two hundred separate dollar amounts assigned to it.

ID   Dollars
001  17000
001  18000
001  23000
002  64000
002  31000
003  96000
003  164000
003  76000

What I'm essentially trying to do can be best explained using an example. I want generate five random samples, with replacement, for each ID. Each sample would have a size of 5 or 5 randomly sampled dollar values. My final result would have 20,000 separate samples (5 samples, per 4000 IDs, each containing 5 randomly selected dollar amounts by ID). I am doing this in order to compare the distributions of dollars in each sample to their fellow samples with the same ID.

As of right now, I'm attempting to garner such an answer using the code referenced below. I should also point out that when I run this script I receive an error that my 'results must be all atomic'. I'm not sure if I need to add additional steps or what.

x <- function(func)
     {
      func<-(lapply(1:5, function(i)
        sample(data$Dollars, size=5, replace=TRUE)))
     }
     grouped.samples<-ddply(data,.variables="ID",.fun=x)

I’m sorry in advance if the question I posed was unclear; I had difficulty articulating the problem I'm having.

Thanks in advance for your help

YimYames
  • 99
  • 1
  • 12

5 Answers5

5

Using data.table:

library(data.table)
dt = as.data.table(your_df)

dt[, Dollars[sample.int(.N, 5, TRUE)], by = ID]
#    ID     V1
# 1:  1  17000
# 2:  1  18000
# 3:  1  18000
# 4:  1  23000
# 5:  1  17000
# 6:  2  31000
# 7:  2  31000
# 8:  2  31000
# 9:  2  31000
#10:  2  64000
#11:  3  96000
#12:  3  96000
#13:  3  76000
#14:  3 164000
#15:  3  76000
eddi
  • 49,088
  • 6
  • 104
  • 155
  • This is far and away the fastest solution, esp. with 4000 IDs. – jlhoward Jun 27 '14 at 21:24
  • thanks for your help @eddi. When I run the second part of the code where you use the sample function I receive the following error: 'Column 1 of result for group 2 is type 'double' but expecting type 'integer'. Column types must be consistent for each group.' – YimYames Jun 27 '14 at 21:35
  • 2
    @YimYames that's because of how `sample` behaves when given a single number - see `?sample` and use the `resample` function instead from the examples there; answer modified to do that – eddi Jun 27 '14 at 21:51
  • thanks again for your help @eddi, its much appreciated. I do, however, have one more question. Based on the sample output you provided above(from your script), is it possible to produce an output that includes columns V2, V3, V4, and V5? Again, each column would be another sample of dollars for each ID. – YimYames Jun 29 '14 at 17:59
  • @YimYames just sample again, e.g. `dt[, list(Dollars[sample.int(.N, 5, T)], Dollars[sample.int(.N, 5, T)]), by = ID]` – eddi Jun 30 '14 at 15:39
4

I thought I'd add a dplyr solution, using sample_n just as in one of the answers to this question.

require(dplyr)
dat1 %>%
    group_by(ID) %>%
    do(sample_n(., 5, replace = TRUE))

EDIT:

After looking at the help for sample_n more, I realized that the sample_n function should work directly within groups (so, without the do). It doesn't currently, which is a known issue.

Community
  • 1
  • 1
aosmith
  • 34,856
  • 9
  • 84
  • 118
2

I would try something like this:

cbind(rep(unique(d$ID), each=5), 
      unlist(tapply(d$Dollars, d$ID, FUN=sample, size=5, replace=TRUE)))
   [,1]   [,2]
11    1  18000
12    1  17000
13    1  18000
14    1  17000
15    1  17000
21    2  31000
22    2  31000
23    2  64000
24    2  64000
25    2  64000
31    3 164000
32    3  96000
33    3  96000
34    3  76000
35    3  96000
Thomas
  • 43,637
  • 12
  • 109
  • 140
0

Try this.

# create sample dataset...
df <- data.frame(ID=rep(1:400,each=10),Dollars=1000*rpois(4000,5))

# this does the work...
result <- do.call(rbind,lapply(split(df,df$ID),function(x)x[sample(1:nrow(x),5, replace=T),]))
jlhoward
  • 58,004
  • 7
  • 97
  • 140
0

Maybe this could be enough if I have understand the problem :

sapply(unique(data$ID), function(x) sample(data$Dollars, 5, replace=T))