-1

I've been doing laps trying to find a solution to this query and hoping the community can provide some inspiration.

I have a large data.table consisting of customer activity information represented as follows:

library(data.table)
library(dplyr)

DF = as.data.table(NULL)
cust_index = as.data.table(seq(1000,10000,3)) # list of unique customers
colnames(cust_index) = "cust_id"

# create a list of all customer activity - each cust_id represents an active event

for (cust in cust_index$cust_id){
  each_cust = as.data.table(rep(cust, sample(1:17,1, replace=FALSE)))
  DF = bind_rows(DF, each_cust)
  }
rm(each_cust)
colnames(DF) = "cust_id"
setkey(DF, cust_id)

# add dummy data for activity
DF[, A:= sample(x = c(0,1), size = nrow(DF), replace = TRUE)]
DF[, B:= sample(x = c(0,1), size = nrow(DF), replace = TRUE)]
DF[, C:= sample(x = c(0,1), size = nrow(DF), replace = TRUE)]

I want to sample a maximum of 4 customer observations from DF.

So far I have used a function which samples the observations relative to a single customer:

sample.cust = function(x){
  if (nrow(x)<4) {
    cust_sample = x 
  } else {
    cust_sample = x[sample(1:4,replace=FALSE)]
  }
  return(cust_sample)
}

.. which is called from within a for loop.

for (cust in cust_index$cust_id){
  cust.sample = train.data[.(cust), sample.cust(.SD)]
  train.sample = bind_rows(train.sample, cust.sample)
 }

.. however the above for loop never terminates.

I've tried all manner of := and set combinations to achieve this without success so far. Any suggestions would be much appreciated for what I imagine will be a rather trivial solution.

Many thanks, M.

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317

1 Answers1

1

A solution was posted as a comment in a now deleted answer which indexed using the .I operator from data table:

DF[DF[,sample((.I), min(.N, 4), replace=FALSE), by=cust_id]$V1]

While this was useful it neglected the case where the number of rows to be sampled was of length 1. Including a function call within data.table achieved the correct result:

resamp = function(.N, .I){
  if(.N==1) .I else sample((.I), min(.N, 4))
}

DF[ DF[, resamp(.N, .I), by="cust_id"]$V1]
  • Ah, good point. I always forget that edge case. For reference, the $V1 idea comes from here: http://stackoverflow.com/questions/16573995/subset-by-group-with-data-table/16574176#16574176 – Frank Mar 19 '17 at 14:03