I'm working on a dataset with 5,500,000 rows. One subject has ~15 information (each row), and I want to extract features per subject with a function.
It is too computationally extensive if I run the function for the whole dataset, so I want to split data (say for each 3000 subjects) and run one by one.
My code is:
grouped_data=data %>% group_split(group_ID)
data_list=list()
for (i in 1:10){
print(paste0("Working on dataset_",i))
temp_file=do.call("rbind", grouped_data[c( (floor(length(g)/10 * (i-1)) + 1) : (floor(length(g)/10 * i)))])
data_list[[i]]=FunctionX(temp_file)
}
My intention is, to control the computational intensity by changing the range of i while saving the results of each batch.
However, I could not run the first step due to the memory limit.
Could anyone please suggest any idea? Thank you in advance.