0

I have this data frame in R:

my_data = data.frame(id = 1:100, var = rnorm(100,100,100))

I want to make two lists (my_list_1, my_list_2) that store data from this frame like this:

my_list_1 = list()
my_list_2 = list()

my_list_1[1] = my_data[1:10,]
my_list_2[1] = my_data[11:15,]
my_list_1[2] = my_data[1:15,]
my_list_2[2] = my_data[16:20,]
my_list_1[3] = my_data[1:20,]
my_list_2[3] = my_data[21:25,]
# etc.

Until all rows from my_data have been exhausted. As we can see:

  • my_list_1 always starts from the first row of my_data and keeps adding on 5 rows
  • my_list_2 starts from where it left off and only takes 4 rows at a time

I tried to solve the problem like this using the ceiling function in R:

# set the size of the chunks
chunk_size_1 = 10
chunk_size_2 = 5

# loop through the rows of my_data, adding chunks to the lists
for (i in 1:nrow(my_data)) {
  # calculate the current chunk numbers
  chunk_num_1 = ceiling(i / chunk_size_1)
  chunk_num_2 = ceiling((i + (chunk_size_2 - 1)) / chunk_size_2)
  
  # add the current row to the appropriate chunk
  if (!exists(paste0("my_list_1[", chunk_num_1, "]"))) {
    my_list_1[[chunk_num_1]] = my_data[1:i,]
  } else {
    my_list_1[[chunk_num_1]] = my_data[1:(chunk_num_1 * chunk_size_1),]
  }
  
  if (!exists(paste0("my_list_2[", chunk_num_2, "]"))) {
    my_list_2[[chunk_num_2]] = my_data[(i+1):(i+chunk_size_2),]
  } else {
    my_list_2[[chunk_num_2]] = my_data[((chunk_num_2-1) * chunk_size_2 + 1):(chunk_num_2 * chunk_size_2),]
  }
}

# remove NAs from lists (https://stackoverflow.com/questions/25777104/remove-na-from-list-of-lists)
my_list_1 = lapply(my_list_1, function(x) x[!is.na(x)])
my_list_2 = lapply(my_list_2, function(x) x[!is.na(x)])

The code seems to have run - but I am not sure if I am doing this correctly.

Can someone please show me how to do this?

Thanks!

Note:

  • The long-term goal of this question is to use these lists for training time series models using the "rolling window validation approach" (R: Understanding Time Series Functions in R). my_list_1 would serve as the training set and my_list_2 would serve as the test set.

  • That is, I would fit a time series model (e.g. arima) to my_list_1[1] (e.g. fit1 = auto.arima(my_list_1[1])), use this model to predict the data from my_list_2[1] (e.g. forecast = forecast(fit1, h = 4)) - and then, record the error (e.g. RMSE, MAE) by comparing the true values to the predicted values .

  • Then, I would fit a time series model (e.g. arima) to my_list_1[2], use this model to predict the data from my_list_2[2], record the error (e.g. RMSE, MAE)

  • When I would finish all positions within my_list_1 and my_list_2, I would take the average errors (i.e. average RMSE and average MAE) over all positions.

stats_noob
  • 5,401
  • 4
  • 27
  • 83
  • Why do you need this? What do you need it for? There must be a bigger problem to solve and probably better to use functions that solve the problem – Onyambu Mar 17 '23 at 04:17

1 Answers1

2

Here's one way to do that:

x <- seq(from = 10, to = 100, by = 5)
y <- x[-length(x)] + 1

my_list_1 <- lapply(x, \(x) my_data[seq_len(x), ])
my_list_2 <- lapply(y, \(x) my_data[seq(from = x, by = 1, length.out = 5), ])
Mwavu
  • 1,826
  • 6
  • 14