How to efficiently work on too many rows and groups

Question

I'm working on a dataset with 5,500,000 rows. One subject has ~15 information (each row), and I want to extract features per subject with a function.

It is too computationally extensive if I run the function for the whole dataset, so I want to split data (say for each 3000 subjects) and run one by one.

My code is:

grouped_data=data %>% group_split(group_ID)

data_list=list()

for (i in 1:10){
  print(paste0("Working on dataset_",i))

  temp_file=do.call("rbind", grouped_data[c( (floor(length(g)/10 * (i-1)) + 1) : (floor(length(g)/10 * i)))])
  
  data_list[[i]]=FunctionX(temp_file)
}

My intention is, to control the computational intensity by changing the range of i while saving the results of each batch.

However, I could not run the first step due to the memory limit.

Could anyone please suggest any idea? Thank you in advance.

In your first line, you effectively make a copy of `data`. That's not a great start if you're having memory problems. An alternative *might* be to use tidyverse's `group_map()` on the original `data`. `group_map` applies a function (supplied as an argument) to each group mof a grouped data frame. Other variants, such as `group_walk` may be more appropriate for your use case. You maximise your chance of getting a useful answer if you provide a minimal reproducible example. [This post](https://stackoverflow.com/help/minimal-reproducible-example) may help. — Limey, Jun 01 '22 at 08:10
What is the file extension? If you can read that file as a csv or a txt-like, there are functions that can read that extensions until a determined line, reducing considerably memory. I do not know if other types of files can also be applied to. — Sebek, Jun 01 '22 at 08:31
Thanks. My file is .feather. I realized that this is a huge area that I have not learnt so far. I will read the links that you kindly provided. — Rik, Jun 01 '22 at 23:50

How to efficiently work on too many rows and groups

0 Answers0