4

I am a newbie to R, but I am aware that it chokes on "big" files. I am trying to read a 200MB data file. I have tried it in csv format and also converting it to tab delimited txt but in both cases I use up my 4GB of RAM before the file loads.

Is it normal that R would use 4GB or memory to load a 200MB file, or could there be something wrong with the file and it is causing R to keep reading a bunch of nothingness in addition to the data?

Oliver
  • 194
  • 4
  • 10
  • 6
    1) 200MB is nowhere near "big". 2) What function are you using to read the file (and did you read the help page for that function)? 3) What sort of data are in the file? 4) Did you search SO (I found several relevant questions/answers by searching for "[r] large file"). 5) Did you read [R Data Import/Export](http://cran.r-project.org/doc/manuals/R-data.html)? – Joshua Ulrich Jan 12 '13 at 14:25
  • 1
    And provide us with a reproducible piece of code which shows your problem. A 200 mb csv files should not take 4 Gb normally. – Paul Hiemstra Jan 12 '13 at 14:29
  • I tried using the basic ways to read the file, as it is all I know:read.table("myfile.csv", header=TRUE) or for the txt version I used read.table("myfile.txt", sep = "\t", header=TRUE). There are 200+ columns which mostly have single letters or small numbers. Tragically, there is a big group of variables in the middle which are sparsley populated. – Oliver Jan 12 '13 at 14:50
  • Could the length of the variable names be gumming things up? Some of them are comically long, I guess they thought they were being very descriptive. – Oliver Jan 12 '13 at 14:53
  • @Oliver If you are not experienced with read.table/scan/..and other R functions to read/write data, I think you should first, take the time to test them in small files, and once you understand the basics you try to read your "hudge" file. – agstudy Jan 12 '13 at 14:58

1 Answers1

11

From ?read.table

Less memory will be used if colClasses is specified as one of the six atomic vector classes.

...

Using nrows, even as a mild over-estimate, will help memory usage.

Use both of these arguments.

Ensure that you properly specify numeric for your numeric data. See here: Specifying colClasses in the read.csv

And do not under-estimate nrows.

If you're running 64-bit R, you might try the 32-bit version. It will use less memory to hold the same data.

See here also: Extend memory size limit in R

Community
  • 1
  • 1
Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
  • 4
    Why do you mention `nrows`, but not `colClasses`? Storing numbers as strings is very inefficient and calling `type.convert` causes unnecessary duplication. And how do you know that `read.table` is the best function to use? The OP didn't even tell you what their data looks like (e.g. `scan` could be a better solution if it's only a matrix). – Joshua Ulrich Jan 12 '13 at 14:28
  • @Oliver Determine your column classes (see the linked question) and include that argument. If that doesn't do the job, also use `nrows`. The difference between tab- and comma-separated values is not going to matter. – Matthew Lundberg Jan 12 '13 at 14:48
  • Thanks, I will look into nrows and colClasses. – Oliver Jan 12 '13 at 15:08