0

I'm working with a data.frame that is about 2 million rows, I need to group rows and apply functions to them, and I was using split.data.frame and modify for that.

Unfortunately the split.data.frame alone breaks the memory limit. I'm working on my company's server, so I can't really install a new r version or add any memory or anything.

I think I can multi thread the modify part, but first the the splitting needs to be successful.

What else can I try?

PanikLIji
  • 31
  • 4
  • 2M rows is hardly a large data.frame. Or are there hundreds of columns involved? – Wimpel Apr 20 '21 at 11:03
  • Just 50. But it splits into a list of 500 000 dataframe of about 4 GB and rstudio just freezes... – PanikLIji Apr 20 '21 at 11:11
  • you can try to post some sample data,the code you are currently using and your desired output. This way, we can see if there are more efficient approaches possible. Here (https://stackoverflow.com/a/5963610/6356278) is how to create such a minimal example – Wimpel Apr 20 '21 at 11:25
  • convert you data.frame into a data.table with `as.data.table(df)`. This should lift some burden off your memory. You will need to install the data.table package first: `install.packages("data.table")` – NicolasH2 Apr 20 '21 at 11:36
  • @Solarion while `data.table` is very efficient in terms of speed, it is not always the best choice when working with a small amount of memory. It still might be the best approach here, but speed does not always equal optimal memory usage. – Wimpel Apr 20 '21 at 11:40
  • @Wimpel, I've found `data.table` generally to be *better* with memory constraints, simply due to its referential semantics. What makes you think that it could perform worse (under some circumstances)? – r2evans Apr 20 '21 at 11:48
  • 4
    split the grouping variable rather than the data frame and then loop through the individual groups so that only one group is worked on at a time. – G. Grothendieck Apr 20 '21 at 11:52
  • @r2evans, [this answer](https://stackoverflow.com/a/61250376/6356278) comes to mind (but the problem there might just be my coding) – Wimpel Apr 20 '21 at 11:54
  • @Wimpel, interesting, thanks. I would not have thought `copy(.)` would be necessary in that use, and it may be contributing to memory growth of the solution. I don't know for certain, I am also not proficient in `profvis` to be able to model what's going on. Thanks. – r2evans Apr 20 '21 at 12:11
  • 1
    While I agree that @G.Grothendieck's suggestion should avoid some memory problems, I don't know if @Wimpel's concern about `data.table`'s memory usage is related to the other linked code or something more global. I work regularly on 2M+ rows, and `data.table`'s grouping to me seems fairly memory-efficient, so I believe that its grouping operations are a worthwhile thing to try. If you post a *small* (20x10?) sample of your data with 2-3 such groups, an example of the functions you need for aggregation, we should be able to recommend something more concrete. – r2evans Apr 20 '21 at 12:28

0 Answers0