2

I am trying to create 10 folds of my data. What I want to have is a data structure of length 10 (number of folds) and each element of the data structure contains an object/data structure that has two attributes/elements; the training set and the test set at that fold. This is my R code. I wanted to access for example, the training set at fold 8 by View(data_pairs[[8]]$training_set). But it did not work. Any help would be appreciated :)

k <- 10    # number of folds

i <- 1:k  


folds <- sample(i, nrow(data), replace = TRUE)

data_pairs <- list()

for (j in i) {
  
  test_ind <- which(folds==j,arr.ind=TRUE)
  
  test <- data[test_ind,]
  train <- data[-test_ind,]
  
  data_pair <- list(training_set = list(train), test_set = list(test))
  
  data_pairs <- append(x = data_pairs, values = data_pair)
  
}
Cem
  • 93
  • 1
  • 1
  • 13
Mohamed Yehia
  • 53
  • 1
  • 8

2 Answers2

2

You were very close, you just needed to wrap values in a list call.

k <- 10    # number of folds

i <- 1:k  
folds <- sample(i, nrow(mtcars), replace = TRUE)

data_pairs <- list()

for (j in i) {
  
  test_ind <- which(folds==j,arr.ind=TRUE)
  
  test <- mtcars[test_ind,]
  train <- mtcars[-test_ind,]
  data_pair <- list(training_set = train, test_set = test)
  data_pairs <- append(x = data_pairs, values = list(data_pair))
  #data_pairs <- c(data_pairs, list(data_pair))
}

If your data is big I would suggest you read these two posts on more efficient ways to grow a list.

  1. Append an object to a list in R in amortized constant time, O(1)?
  2. Here we go again: append an element to a list in R

I would also like to point out that you are not creating "folds" of your data. In your case you are attempting a 10-fold cross validation, which means your data should be separated into 10 "equal" sized chunks. Then you create 10 train/test data sets using each fold as the test data and the rest for training.

emilliman5
  • 5,816
  • 3
  • 27
  • 37
1

It seems like the package modelr could help you here. In particular I would point you to:

https://modelr.tidyverse.org/reference/resample_partition.html

library(modelr)
ex <- resample_partition(mtcars, c(test = 0.3, train = 0.7))

mod <- lm(mpg ~ wt, data = ex$train)

rmse(mod, ex$test)
#> [1] 3.229756
rmse(mod, ex$train)
#> [1] 2.88216

Alternatively, producing a dataset of these partitions can be done with: crossv_mc(data, n, test = 0.2, id = ".id")

BrianLang
  • 831
  • 4
  • 14