I've been doing laps trying to find a solution to this query and hoping the community can provide some inspiration.
I have a large data.table consisting of customer activity information represented as follows:
library(data.table)
library(dplyr)
DF = as.data.table(NULL)
cust_index = as.data.table(seq(1000,10000,3)) # list of unique customers
colnames(cust_index) = "cust_id"
# create a list of all customer activity - each cust_id represents an active event
for (cust in cust_index$cust_id){
each_cust = as.data.table(rep(cust, sample(1:17,1, replace=FALSE)))
DF = bind_rows(DF, each_cust)
}
rm(each_cust)
colnames(DF) = "cust_id"
setkey(DF, cust_id)
# add dummy data for activity
DF[, A:= sample(x = c(0,1), size = nrow(DF), replace = TRUE)]
DF[, B:= sample(x = c(0,1), size = nrow(DF), replace = TRUE)]
DF[, C:= sample(x = c(0,1), size = nrow(DF), replace = TRUE)]
I want to sample a maximum of 4 customer observations from DF.
So far I have used a function which samples the observations relative to a single customer:
sample.cust = function(x){
if (nrow(x)<4) {
cust_sample = x
} else {
cust_sample = x[sample(1:4,replace=FALSE)]
}
return(cust_sample)
}
.. which is called from within a for loop.
for (cust in cust_index$cust_id){
cust.sample = train.data[.(cust), sample.cust(.SD)]
train.sample = bind_rows(train.sample, cust.sample)
}
.. however the above for loop never terminates.
I've tried all manner of := and set combinations to achieve this without success so far. Any suggestions would be much appreciated for what I imagine will be a rather trivial solution.
Many thanks, M.