I am trying to load a large CSV file (226M rows by 38 columns) on R 64-bit using the data.table
package. The size of the file on disk is about 27Gb. I am doing this on a server with 64GB of RAM. I shut most everything else down and started a fresh R/Rstudio session so when I start the fread
only 2Gb of memory are used. As the read processes, I see the memory usage climb to about 45.6 Gb, and then I get the dreaded Error: cannot allocate vector of size 1.7 Gb
. However, there remains over 18Gb available. Is it possible that in 18Gb of RAM there isn't a single contiguous block of 1.7Gb? Does it have to do with the committed size (which I admit to not fully understanding), and if so, is there any way to minimize the committed size so that enough space remains
The list comprises the history of a cohort of users for which I want to aggregate and summarize certain statistics over time. I've been able to import a subset of the 38 columns using select
in fread
, so I'm not at a complete loss, but it does mean that should I need to work with other variables, I'll need to pick and choose, and may eventually run into the same error.
For the setup which I have, are there other ways to get this entire dataset into memory, or will I either need to continue only importing subsets or move to a big-data friendly platform?
Thank you.
Memory Usage Prior to Read
Memory Usage at Failure
Session Info
R version 3.3.0 Patched (2016-05-11 r70599)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.6
loaded via a namespace (and not attached):
[1] tools_3.3.0 chron_2.3-47