I have this data frame in R:
my_data = data.frame(id = 1:100, var = rnorm(100,100,100))
I want to make two lists (my_list_1, my_list_2) that store data from this frame like this:
my_list_1 = list()
my_list_2 = list()
my_list_1[1] = my_data[1:10,]
my_list_2[1] = my_data[11:15,]
my_list_1[2] = my_data[1:15,]
my_list_2[2] = my_data[16:20,]
my_list_1[3] = my_data[1:20,]
my_list_2[3] = my_data[21:25,]
# etc.
Until all rows from my_data have been exhausted. As we can see:
my_list_1
always starts from the first row ofmy_data
and keeps adding on 5 rowsmy_list_2
starts from where it left off and only takes 4 rows at a time
I tried to solve the problem like this using the ceiling function in R:
# set the size of the chunks
chunk_size_1 = 10
chunk_size_2 = 5
# loop through the rows of my_data, adding chunks to the lists
for (i in 1:nrow(my_data)) {
# calculate the current chunk numbers
chunk_num_1 = ceiling(i / chunk_size_1)
chunk_num_2 = ceiling((i + (chunk_size_2 - 1)) / chunk_size_2)
# add the current row to the appropriate chunk
if (!exists(paste0("my_list_1[", chunk_num_1, "]"))) {
my_list_1[[chunk_num_1]] = my_data[1:i,]
} else {
my_list_1[[chunk_num_1]] = my_data[1:(chunk_num_1 * chunk_size_1),]
}
if (!exists(paste0("my_list_2[", chunk_num_2, "]"))) {
my_list_2[[chunk_num_2]] = my_data[(i+1):(i+chunk_size_2),]
} else {
my_list_2[[chunk_num_2]] = my_data[((chunk_num_2-1) * chunk_size_2 + 1):(chunk_num_2 * chunk_size_2),]
}
}
# remove NAs from lists (https://stackoverflow.com/questions/25777104/remove-na-from-list-of-lists)
my_list_1 = lapply(my_list_1, function(x) x[!is.na(x)])
my_list_2 = lapply(my_list_2, function(x) x[!is.na(x)])
The code seems to have run - but I am not sure if I am doing this correctly.
Can someone please show me how to do this?
Thanks!
Note:
The long-term goal of this question is to use these lists for training time series models using the "rolling window validation approach" (R: Understanding Time Series Functions in R).
my_list_1
would serve as the training set andmy_list_2
would serve as the test set.That is, I would fit a time series model (e.g. arima) to
my_list_1[1]
(e.g.fit1 = auto.arima(my_list_1[1])
), use this model to predict the data frommy_list_2[1]
(e.g.forecast = forecast(fit1, h = 4)
) - and then, record the error (e.g. RMSE, MAE) by comparing the true values to the predicted values .Then, I would fit a time series model (e.g. arima) to
my_list_1[2]
, use this model to predict the data frommy_list_2[2]
, record the error (e.g. RMSE, MAE)When I would finish all positions within
my_list_1
andmy_list_2
, I would take the average errors (i.e. average RMSE and average MAE) over all positions.