13

I am trying to load a large CSV file (226M rows by 38 columns) on R 64-bit using the data.table package. The size of the file on disk is about 27Gb. I am doing this on a server with 64GB of RAM. I shut most everything else down and started a fresh R/Rstudio session so when I start the fread only 2Gb of memory are used. As the read processes, I see the memory usage climb to about 45.6 Gb, and then I get the dreaded Error: cannot allocate vector of size 1.7 Gb. However, there remains over 18Gb available. Is it possible that in 18Gb of RAM there isn't a single contiguous block of 1.7Gb? Does it have to do with the committed size (which I admit to not fully understanding), and if so, is there any way to minimize the committed size so that enough space remains

The list comprises the history of a cohort of users for which I want to aggregate and summarize certain statistics over time. I've been able to import a subset of the 38 columns using select in fread, so I'm not at a complete loss, but it does mean that should I need to work with other variables, I'll need to pick and choose, and may eventually run into the same error.

For the setup which I have, are there other ways to get this entire dataset into memory, or will I either need to continue only importing subsets or move to a big-data friendly platform?

Thank you.

Memory Usage Prior to Read

Memory usage prior to read

Memory Usage at Failure

Memory usage at failure

Session Info

R version 3.3.0 Patched (2016-05-11 r70599)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.6

loaded via a namespace (and not attached):
[1] tools_3.3.0  chron_2.3-47
Community
  • 1
  • 1
Avraham
  • 1,655
  • 19
  • 32
  • You might take a look at the `iotools` package which has some features that may help in this case. – lmo Jul 06 '16 at 16:08
  • 1
    Check the `memory.limit()` firstly to see how much memory is actually available to R. – Psidom Jul 06 '16 at 16:09
  • 1
    @Psidom, I did, and it was the full 64GB. Thanks. – Avraham Jul 06 '16 at 16:15
  • 6
    From that link explaining "committed size" it indeed sounds like you simply ran out of actual RAM (that's been promised to R, but not yet filled with actual bits). Given that your 27Gb file expands to larger size - I'm guessing you have a lot of small integers in there? One hack would be to read those columns in as strings, and then only convert them to integers as necessary in whatever operations you do. – eddi Jul 06 '16 at 16:49
  • Have you tried "chunking" the `fread()`? That is, looping over the file, reading, say, 50 000 lines at a time and doing a call to `gc()` at the end of each chunk read? The `nrows` and `skip` parameters should help here. – Jason Oct 06 '16 at 23:16
  • I know this issue, not with fread, but using other functions. I don't know if this also applies to a windows machine, but on unix you can start your R session from the console, specifying --max-mem-size and --max-vsize. Here are some more information on that: http://answers.stat.ucla.edu/groups/answers/wiki/0e7c9/Increasing_Memory_in_R.html –  Oct 11 '16 at 10:23
  • @eddi You're probably right. As simple as it may be, please post your comment as an answer so I can give you credit :). – Avraham Jan 06 '17 at 02:36

1 Answers1

4

You're running out of memory because some types of data can use less memory as plain text than in memory (the opposite also can be true). The classic example of this is e.g. single digit integers (0-9), which will only occupy a single byte in a text file, but will use 4 bytes of memory in R (and conversely larger than 4-digit numbers will occupy less memory than corresponding text characters would).

One workaround for this is to read those columns as character instead, which will keep the memory footprint the same, and only convert them to integers when doing numeric operations on them. The trade off will naturally be speed.

eddi
  • 49,088
  • 6
  • 104
  • 155