1

I'm working with large datasets in R and I need to find effective strategies to handle them without running out of memory. As the datasets grow in size, I want to ensure that my R scripts and computations can handle the data efficiently.

I have attempted loading the entire dataset into memory using functions like read.csv() or data.table::fread(), but it often leads to memory allocation errors. I have also explored techniques such as chunk processing or using database connections, but I'm not sure if they are the most optimal approaches for my specific scenario.

FriedRobot
  • 33
  • 4
  • There have been several similar questions here in the past. Your question shows no evidence that you have conducted any searching here or on the wider web. Maybe you omitted this information and can easily fill us in on that search. (Thorough searching Is expected!) – IRTFM Jun 26 '23 at 05:05
  • I'd suggest reading relevant parts of the [CRAN Task View on High Performance Computing](https://cran.r-project.org/web/views/HighPerformanceComputing.html). Overall, I think this question is far too broad. You say you aren't sure if databases or chunk processing *"are the most optimal approaches for my specific scenario,"* but there's no detail about your scenario in your question. – Gregor Thomas Jun 26 '23 at 05:44
  • But generally R works best with in-memory data. So the general direction to look in are (a) use a compute resource with enough memory for your data (e.g., the cloud) and (b) reduce your data size to fit on your computer (downsample, work in chunks). If you want to diagnose your code to identify areas for improvement, you can try some [memory profiling](https://cran.r-project.org/web/packages/profmem/vignettes/profmem.html). – Gregor Thomas Jun 26 '23 at 05:48
  • https://stackoverflow.com/a/76457265/646761 , https://stackoverflow.com/a/68693819/646761 , ... – margusl Jun 26 '23 at 06:26

1 Answers1

-1

The best option will depend on your particular use case, but here are some other ideas (in addition to using a remote database, or chunk processing, which you mentioned):

  • get more RAM / get a computer with more RAM / use a computer with more RAM (goes without saying)
  • use a cloud computing platform (one which specialises in big data would be ideal). They will have computers you can use which would most likely have several times the amount of memory you have
  • sample the dataset
  • convert values into the most space efficient versions of themselves (e.g. if a double column only includes whole numbers, switch to the Integer data type. If a date column is encoded as a datetime, or as a string, convert it to a Date type instead (credit: Gregor Thomas)
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
Mark
  • 7,785
  • 2
  • 14
  • 34
  • 1
    Your last point about repeated "John"s isn't very accurate. R has had the `factor` class for decades which do what you describe under the hood, and has used a [global string pool](https://adv-r.hadley.nz/names-values.html#character-vectors) for `character` vectors for at least 10 years (maybe longer, maybe since the beginning, I'm not really sure) – Gregor Thomas Jun 26 '23 at 05:25
  • 1
    (Though obviously there could be other ways of cutting down data size more effectively - using integers instead of numerics, dates instead of datetimes, etc.) – Gregor Thomas Jun 26 '23 at 05:28
  • ah true! I didn't think of that @GregorThomas. Thanks for pointing that out. Updated – Mark Jun 26 '23 at 05:30
  • 1
    My point about the global string pool is that it works roughly the same way as factors do, just on a global scale. The differences between strings, factors, and the lookup tables you proposed at first will all be pretty negligible. – Gregor Thomas Jun 26 '23 at 05:41
  • Feel free to edit it to something better @GregorThomas! I was just throwing out ideas – Mark Jun 26 '23 at 05:44
  • 1
    thank you for sharing that btw! I wasn't aware of the gloal string pool – Mark Jun 26 '23 at 05:44