0

I am trying to read a large (~700Mb) .csv file into R.

The file contains an array of integers less than 256, with a header row and 2 header columns.

I use:

trainSet <- read.csv(trainFileName)

This eventually barfs with:

Loading Data...
R(2760) malloc: *** mmap(size=151552) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
R(2760) malloc: *** mmap(size=151552) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Error: cannot allocate vector of size 145 Kb
Execution halted

Looking at the memory usage, it conks out at about 3Gb usage on a 6Gb machine with zero page file usage at the time of the crash, so there may be another way to fix it.

If I use:

trainSet <- read.csv(trainFileName, header=TRUE, nrows=100)
classes = sapply(train,class); 

I can see that all the columns are being loaded as "integer" which I think is 32 bits.

Clearly using 3Gb to load a part of a 700Mb .csv file is far from efficient. I wonder if there's a way to tell R to use 8 bit numbers for the columns? This is what I've done in the past in Matlab and it's worked a treat, however, I can't seem to find anywhere a mention of an 8 bit type in R.

Does it exist? And how would I tell it read.csv to use it?

Thanks in advance for any help.

Stephan
  • 99
  • 1
  • 9
  • 1
    Possible solutions: A) get more RAM b) Put the data into a database and read chunks as needed using `RMySQL`. c) Quickly spin up an EC2 instance with large memory and run it there. – Maiasaura Sep 04 '12 at 21:11
  • @Maiasaura: or just heed the warnings in the docs. ;-) – Joshua Ulrich Sep 04 '12 at 23:37

3 Answers3

2

The narrow answer is that the add-on package ff allows you to use a more compact representation.

The downside is that the different representation prevents you from passing the data to standard functions.

So you may need to rethink your approach: maybe sub-sampling the data, or getting more RAM.

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • Will check out ff, thanks. Would getting more RAM really help, given it currently conks out before the pagefile starts getting used? – Stephan Sep 04 '12 at 21:14
  • Yes, more RAM is typically the simplest and most efficient hardware fix. – Dirk Eddelbuettel Sep 04 '12 at 21:15
  • Use read.csv.ffdf from package ff. For additional basic data manipulations on package ff, you can use package ffbase. See how far you can get from there before you start to do any real statistics. –  Sep 04 '12 at 21:21
  • If anyone else considers FF for such a task, it's worth noting that FF is rather restricted as to the number of columns it will support - on my machine this seems to be 250. More info here: [link](http://stackoverflow.com/questions/12283203/any-ideas-on-how-to-debug-this-ff-error) – Stephan Sep 06 '12 at 08:28
2

Q: Can you tell R to use 8 bit numbers

A: No. (Edit: See Dirk's comment's below. He's smarter than I am.)

Q: Will more RAM help?

A: Maybe. Assuming a 64 bit OS and a 64 bit instance of R are the starting point, then "Yes", otherwise "No".

Implicit question A: Will a .csv dataset that is 700 MB be 700 MB when read in by read.csv?

A: Maybe. If it really is all integers, it may be smaller or larger. It's going to take 4 bytes for each integer and if most of your integers were in the range of -9 to 10, they might actually "expand" in size when stored as 4 bytes each. At the moment you are only using 1-3 bytes per value so you would expect about a 50% increase in size You would want to use colClasses="integer"in the read-function. Otherwise they may get stored as factor or as 8 byte "numeric" if there are any data-input glitches.

Implicit question B: If You get the data into the workspace will you be able to work with it?

A: Only maybe. You need at a minimum three times as much memory as your largest objects because of the way R copies on assignment even if it is a copy to its own name.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Have a look at packages `bit` as well as `ff`, I think you point one here is not entirely correct unless you consider only base R. – Dirk Eddelbuettel Sep 04 '12 at 21:44
  • If I am reading their summary pages correctly, neither of those packages supports in memory operations on 8-bit integers. "ff" provides disk-based operations and "bit" provides logical (1-bit) operations. – IRTFM Sep 04 '12 at 22:02
  • That's not what I recall from previous discussions with Jens, but I may well be wrong... – Dirk Eddelbuettel Sep 04 '12 at 22:03
  • Would be easier to use your Cpp interface and do those 8 bit operations at the C level? – IRTFM Sep 04 '12 at 22:06
  • I had `bit` on my box, the very first example starts `x <- bit(12); x` to autoprint the 12 bits -- stating it is stored in 'only 1 integers' (sic). The whole point is storage efficiency. – Dirk Eddelbuettel Sep 04 '12 at 22:06
  • (It's called Rcpp.) I think `bit` does that as it comes with shared library. – Dirk Eddelbuettel Sep 04 '12 at 22:07
2

Not trying to be snarky, but the way to fix this is documented in ?read.csv:

 These functions can use a surprising amount of memory when reading
 large files.  There is extensive discussion in the ‘R Data
 Import/Export’ manual, supplementing the notes here.

 Less memory will be used if ‘colClasses’ is specified as one of
 the six atomic vector classes.  This can be particularly so when
 reading a column that takes many distinct numeric values, as
 storing each distinct value as a character string can take up to
 14 times as much memory as storing it as an integer.

 Using ‘nrows’, even as a mild over-estimate, will help memory
 usage.

This example takes awhile to run because of I/O, even with my SSD, but there are no memory issues:

R> # In one R session
R> x <- matrix(sample(256,2e8,TRUE),ncol=2)
R> write.csv(x,"700mb.csv",row.names=FALSE)

R> # In a new R session
R> x <- read.csv("700mb.csv", colClasses=c("integer","integer"),
+ header=TRUE, nrows=1e8)
R> gc()
            used  (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    173632   9.3     350000   18.7    350000   18.7
Vcells 100276451 765.1  221142070 1687.2 200277306 1528.0
R> # Max memory used ~1.5Gb
R> print(object.size(x), units="Mb")
762.9 Mb
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • Yeah - I've already tried specifying the colClasses as integer and it didn't seem to change anything. Still need to figure out why I'm out of memory when I have 3Gb of unused RAM, but for now I have plenty of avenues for exploration. – Stephan Sep 05 '12 at 06:00
  • @Stephan: note that I didn't *just* specify `colClasses`. Specifying `nrows` helps because of this portion of the Note section in `?scan` (`read.table` and `read.csv` use `scan`), "If number of items is not specified, the internal mechanism re-allocates memory in powers of two and so could use up to three times as much memory as needed." – Joshua Ulrich Sep 05 '12 at 09:59
  • aha. That makes a lot of sense now - certainly explains why it was running out of memory with so much free RAM. Will give it a go after I've finished convincing myself FF is more pain than I can cope with. Thanks again. – Stephan Sep 05 '12 at 13:45