3

I'm working on a 64-bit Windows Server 2008 machine with Intel Xeon processor and 24 GB of RAM. I'm having trouble trying to read a particular TSV (tab-delimited) file of 11 GB (>24 million rows, 20 columns). My usual companion, read.table, has failed me. I'm currently trying the package ff, through this procedure:

> df <- read.delim.ffdf(file       = "data.tsv",
+                       header     = TRUE,
+                       VERBOSE    = TRUE,
+                       first.rows = 1e3,
+                       next.rows  = 1e6,
+                       na.strings = c("", NA),
+                       colClasses = c("NUMERO_PROCESSO" = "factor"))

Which works fine for about 6 million records, but then I get an error, as you can see:

read.table.ffdf 1..1000 (1000) csv-read=0.14sec ffdf-write=0.2sec
read.table.ffdf 1001..1001000 (1000000) csv-read=240.92sec ffdf-write=67.32sec
read.table.ffdf 1001001..2001000 (1000000) csv-read=179.15sec ffdf-write=94.13sec
read.table.ffdf 2001001..3001000 (1000000) csv-read=792.36sec ffdf-write=68.89sec
read.table.ffdf 3001001..4001000 (1000000) csv-read=192.57sec ffdf-write=83.26sec
read.table.ffdf 4001001..5001000 (1000000) csv-read=187.23sec ffdf-write=78.45sec
read.table.ffdf 5001001..6001000 (1000000) csv-read=193.91sec ffdf-write=94.01sec
read.table.ffdf 6001001..
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  could not allocate memory (2048 Mb) in C function 'R_AllocStringBuffer'

If I'm not mistaken, R is complaining of lack of memory to read the data, but wasn't the read...ffdf procedure supposed to circumvent heavy memory usage when reading data? What could I be doing wrong here?

Community
  • 1
  • 1
Waldir Leoncio
  • 10,853
  • 19
  • 77
  • 107
  • See some answers in this thread that might be of help: http://stackoverflow.com/a/1820610/3897439. Especially using `fread` from the `data.table` package and `sqldf` package. I know your issue isn't time efficiency, but maybe one of these methods will also avoid the memory issue. – Cotton.Rockwood Aug 09 '14 at 03:28
  • Also, if you are reading this data in the middle of more code, this might help identify where your memory usage is: http://stackoverflow.com/q/1358003/3897439 – Cotton.Rockwood Aug 09 '14 at 03:30
  • @Cotton.Rockwood: Thanks for the help. Unfortunately, `fread` has never worked with the data frames that I work with, since they always include one character column that is full of embedded quotes (the function help page itself acknowledges the issue). As to the `sqldf` alternative your link points to directly, I can't seem to make it work because apparently it's looking for a comma or semi-colon as a separator, and my file uses tabs. – Waldir Leoncio Aug 11 '14 at 18:54

1 Answers1

7

(I realize this is an old question, but I had the same problem and spent two days looking for the solution. This seems as good a place as any to document what I eventually figured out for posterity.)

The problem isn't that you are running out of available memory. The problem is that you've hit the memory limit for a single string. From help('Memory-limits'):

There are also limits on individual objects. The storage space cannot exceed the address limit, and if you try to exceed that limit, the error message begins cannot allocate vector of length. The number of bytes in a character string is limited to 2^31 - 1 ~ 2*10^9, which is also the limit on each dimension of an array.

In my case (and it appears yours as well) I didn't bother to set the quote character since I was dealing with tab separated data and I assumed it didn't matter. However, somewhere in the middle of the data set, I had a string with an unmatched quote, and then read.table happily ran right past the end of line and on to the next, and the next, and the next... until it hit the limit for the size of a string and blew up.

The solution was to explicitly set quote = "" in the argument list.

Drew
  • 1,059
  • 9
  • 9