2

I'm running R on linux (kubuntu trusty). I have a csv file that's nearly 400MB, and contains mostly numeric values:

$ ls -lah combined_df.csv 
-rw-rw-r-- 1 naught101 naught101 397M Jun 10 15:25 combined_df.csv

I start R, and df <- read.csv('combined_df.csv') (I get a 1246536x25 dataframe, 3 int columns, 3 logi, 1 factor, and 18 numeric) and then use the script from here to check memory usage:

R> .ls.objects()
         Type  Size    Rows Columns
df data.frame 231.4 1246536      25

Bit odd that it's reporting less memory, but I guess that's just because CSV isn't an efficient storage method for numeric data.

But when I check the system memory usage, top says that R is using 20% of my available 8GB of RAM. And ps reports similar:

$ ps aux|grep R
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
naught1+ 32364  5.6 20.4 1738664 1656184 pts/1 S+   09:47   2:42 /usr/lib/R/bin/exec/R

1.7Gb of RAM for a 379MB data set. That seems excessive. I know that ps isn't necessarily an accurate way of measuring memory usage, but surely it isn't out by a factor of 5?! Why does R use so much memory?

Also, R seems to report something similar in gc()'s output:

R> gc()
           used  (Mb) gc trigger  (Mb)  max used  (Mb)
Ncells   497414  26.6    9091084 485.6  13354239 713.2
Vcells 36995093 282.3  103130536 786.9 128783476 982.6
Community
  • 1
  • 1
naught101
  • 18,687
  • 19
  • 90
  • 138
  • You have no other objects loaded? Have you tried restarting your R session? It may be there is a memory leak somewhere. – stanekam Jul 07 '14 at 01:23
  • @iShouldUseAName: This is with a fresh R session, with no other objects loaded. – naught101 Jul 07 '14 at 01:30
  • 1
    When R takes memory from the system, it rarely gives it back, even if it is no longer using all that memory. – JeremyS Jul 07 '14 at 01:39
  • @JeremyS: but this is a clean session. Are you saying that that extra ~1.3GB of memory is due to memory used in the `read.csv()` function? e.g. `read.csv()` doesn't do garbage collection well? – naught101 Jul 07 '14 at 01:56
  • 1
    Have you employed both of the pieces of advice in `?read.csv` regarding reducing memory usage? There's a whole section on memory usage there that explicitly states: "These functions can use a surprising amount of memory when reading large files.". – joran Jul 07 '14 at 02:17
  • 1
    @JeremyS: that's entirely untrue, and is the entire point of garbage collection. This question is answered in [FAQ 7.42](http://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-is-R-apparently-not-releasing-memory_003f). – Joshua Ulrich Jul 07 '14 at 02:20
  • 1
    @joran: YES! `a <- read.csv('combined_df.csv', nrows=1500000, colClasses=unlist(lapply(read.csv('combined_df.csv', nrows=1), class)))` *significantly* reduces over-all memory usage (by a factor of 3-4). If you want to post an answer along those lines, I will accept it. I will also change change the question title to reflect the problem. – naught101 Jul 07 '14 at 03:03
  • @JeremyS: so I take it `ps` *is* effectively out by a factor of 5 then.. But does that mean that that memory that's allocated by the system, but no-longer used by R will be re-used by R first, before more is allocated? – naught101 Jul 07 '14 at 03:06
  • 1
    `gc()` doesn't always work like you expect. http://adv-r.had.co.nz/memory.html#gc – JeremyS Jul 07 '14 at 03:12

2 Answers2

6

As noted in my comment above, there is a section in the documention ?read.csv entitled "Memory Usage" that warns that anything based on read.table may use a "surprising" amount of memory and recommends two things:

  1. Specify the type of each column using the colClasses argument, and
  2. Specifying nrows, even as a "mild overestimate".
joran
  • 169,992
  • 32
  • 429
  • 468
1

Not sure if you just want to know how R works or if you want an alternative to read.csv, but try fread from data.table, it is much faster and I assume it uses much less memory:

library(data.table)
dfr <- as.data.frame(fread("somecsvfile.csv"))
Remko Duursma
  • 2,741
  • 17
  • 24