2

I am working with the BTYD model to generate forecast on customer future transactions. Unfortunately, due to the use of mcmc methods I cannot run the forecast on my whole base of customers (hundreds of thousands) so I have to split the base in many random samples and perform several runs of this model on each of them to retrieve the forecast.

My idea was to use a loop to do the following:

  1. retrieve a random sample of length 10,000 from the whole base (let's call this data frame as "data")
  2. store the result in an object called "sample1"
  3. Now we have to go back to "data" and exclude customers who are in "sample1" and store the new result in "data".
  4. get a new random sample ("sample2") from the new "data"
  5. create a new version of "data" excluding all customers included in "sample2" (and "sample1").
  6. ... continue this cycle until we finish the base and we have created N samples that contain the whole base.

(Every ID must be in one sample only).

Unfortunately my code doesn't seem to be working in the way I want (I am not very good with loops at the moment.


getwd()

data<-read.csv("MOCK_DATA (1).csv") 
# this is a fake dataset of 1000 rows that contains only 2 columns: 
# customer ID (column name: "id") and a random number (column name "value").
# Every customer ID appears only once in the dataset.

head(data)

set.sample.size<-100
num.cycles<-ceiling(nrow(data)/set.sample.size)

for(i in 1:(num.cycles)) {
 nam <- paste("sample_", i, sep = "")
 assign(nam, data[sample(nrow(data), set.sample.size), ])
 data<-data[!(data$id %in% nam$id),]
}

This code generates the following error: Error in nam$id : $ operator is invalid for atomic vectors

What I expect is to get 10 objects called "sample_1".."sample_10" each of them made of 100 random id from the original data but all unique (no ID are shared between the 10 samples).

Parfait
  • 104,375
  • 17
  • 94
  • 125
GNicoletti
  • 192
  • 2
  • 17
  • How should we load `read.csv("MOCK_DATA (1).csv") `? Please make questions [reproducible](https://stackoverflow.com/a/5963610/6574038) on Stack Overflow. – jay.sf Nov 09 '19 at 13:54

3 Answers3

2

Consider randomly re-ordering entire data by ID then split by equal length rows. End result will be one named list of many data frames instead of many separate objects flooding your global environment.

set.seed(11092019)

# RE-ORDER DATA FRAME (SAME LENGTH)
data <- with(data, data[order(sample(id, nrow(data))),])

# BUILD A LIST OF DFs 
set.sample.size <- 100
data$cycles_group <- paste0("sample_", ceiling(1:nrow(data)/set.sample.size))

df_list <- split(data, data$cycles_group)

# RETRIEVE INDIVIDUAL DF BY NAME
df_list$sample_1#
df_list$sample_2#
df_list$sample_3#
...

Alternatively, with by you can split the samples and run each subset through any of your BTYD model process (similar to split + lapply):

results_list <- by(data, data$cycles_group, function(sub_df) {
   # ... do something with sub_df ...
})
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Very clear. It is true, using the list helps keeping the global enviroment much tidier. Both the answers I have received solve the problem. Thank you very much for your help. – GNicoletti Nov 09 '19 at 14:40
1

Here's a reproducible example using the iris dataset

set.sample.size<-10

num.cycles<-ceiling(nrow(iris)/set.sample.size)


iris$id <- 1:150 


for(i in 1:(num.cycles)) {
  nam <- paste("sample_", i, sep = "")
  assign(nam, iris[sample(nrow(iris), set.sample.size), ])
  iris<-iris[!(iris$id %in% get(nam)$id),]
}

The only issue in your code is nam$id doesn't make sense, since nam is simply a string (the name of the dataframe, not the dataframe itself)

stevec
  • 41,291
  • 27
  • 223
  • 311
  • get() was exactly what my code was missing. A very interesting alternative to this was given by Parfait, using a list rather than a sequence of distinct objects. Thank you very much for your help. – GNicoletti Nov 09 '19 at 14:42
1

Here's a compact way to get a list of samples using mtcars as the dataset without using an explicit loop, with the sample size = 8:

n <- nrow(mtcars)
s <- sample(1:n, replace=FALSE)
sampsize <- 8
nsamps <- n / sampsize
m <- matrix(s, nrow = sampsize)
samps <- lapply(1:nsamps, function(x) mtcars[m[, x], ] )

The rows are randomly selected implicitly by using the vector s. The matrix m contains the vectors of random row numbers.

SteveM
  • 2,226
  • 3
  • 12
  • 16