8

The easy answer to this is "buy more RAM" but I am hoping to get a more constructive answer and learn something in the process.

I am running Windows 7 64-bit with 8GB of RAM.

I have several very large .csv.gz files (~450MB uncompressed) with the same exact header information that I read into R and perform some processing on. Then, I need to combine the processed R objects into a single master object and write back out to .csv on disk.

I do this same operation on multiple sets of files. As an example, I have 5 folders each with 6 csv.gz files in them. I need to end up with 5 master files, one for each folder.

My code looks something like the following:

for( loop through folders ){
    master.file = data.table()

    for ( loop through files ) {
        filename = list.files( ... )
        file = as.data.table ( read.csv( gzfile( filename ), stringsAsFactors = F ))
        gc()

        ...do some processing to file...

        # append file to the running master.file
        if ( nrow(master.file) == 0 ) {
            master.file = file
        } else {
            master.file = rbindlist( list( master.file, file) )
        }
        rm( file, filename )
        gc()
    }

    write.csv( master.file, unique master filename, row.names = FALSE )

    rm( master.file )
    gc()

}

This code does not work. I get the cannot allocate memory error before it writes out the final csv. I was watching resource monitor while running this code and don't understand why it would be using 8GB of RAM to do this processing. The total of all the file sizes is roughly 2.7GB, so I was expecting that the maximum memory R would use is 2.7GB. But the write.csv operation seems to use the same amount of memory as the data object you are writing, so if you have a 2.7GB object in memory and try to write it out, you would be using 5.6 GB of memory.

This apparent reality, combined with using a for loop in which memory doesn't seem to be getting adequately freed up seems to be the problem.

I suspect that I could use the sqldf package as mentioned here and here but when I set the sqldf statement equal to an R variable I ended up with the same out of memory errors.

Community
  • 1
  • 1
Brian D
  • 2,570
  • 1
  • 24
  • 43
  • 1
    The general rule is that you should have 3x the size of your largest object. (So you've already violated that rule.) Furthermore your may have 8MB of RAM but you need to subtract the RAM you use for OS and otehr applications and background utilities. – IRTFM Dec 20 '13 at 23:31
  • 3
    Why don't you use `write.table(yourFirstDataTable, sep = ",", file = YourFile.csv)` for the first file read in and processed, and then `write.table(yourOtherDataTables, sep = ",", file = YourFile.csv, append = TRUE, col.names = FALSE)`? – A5C1D2H2I1M1N2O1R2T1 Dec 21 '13 at 01:34
  • @AnandaMahto that is an excellent suggestion! I forgot that there was an append option to write.table. write.csv disables that option but with write.table, I don't have to append the data in R, I can just append each new object to the file on disk. – Brian D Dec 24 '13 at 01:18

1 Answers1

5

Update 12/23/2013 - The following solution works all in R without running out of memory (Thanks @AnandaMahto).
The major caveat with this method is that you must be absolutely sure that the files you reading in and writing out each time have exactly the same header columns, in exactly the same order, or your R processing code must ensure this since write.table does not check this for you.

for( loop through folders ){

    for ( loop through files ) {

        filename = list.files( ... )
        file = as.data.table ( read.csv( gzfile( filename ), stringsAsFactors = F ))
        gc()

        ...do some processing to file...

        # append file to the running master.file
        if ( first time through inner loop) {
            write.table(file, 
                        "masterfile.csv", 
                        sep = ",", 
                        dec = ".", 
                        qmethod = "double", 
                        row.names = "FALSE")
        } else {
            write.table(file,
                        "masterfile.csv",
                        sep = ",",
                        dec = ".",
                        qmethod = "double",
                        row.names = "FALSE",
                        append = "TRUE",
                        col.names = "FALSE")
        }
        rm( file, filename )
        gc()
    }
    gc()
}

My Initial Solution:

for( loop through folders ){

    for ( loop through files ) {
        filename = list.files( ... )
        file = as.data.table ( read.csv( gzfile( filename ), stringsAsFactors = F ))
        gc()

        ...do some processing to file...

        #write out the file
        write.csv( file, ... )
        rm( file, filename )
        gc()
    }        
    gc()
}

I then downloaded and installed GnuWin32's sed package and used Windows command line tools to append the files as follows:

copy /b *common_pattern*.csv master_file.csv

This appends together all of the individual .csv files whose names have the text pattern "common_pattern" in them, headers and all.

Then I use sed.exe to remove all but the first header line as follows:

"c:\Program Files (x86)\GnuWin32\bin\sed.exe" -i 2,${/header_pattern/d;} master_file.csv

-i tells sed to just overwrite the specified file (in-place).

2,$ tells sed to look at range from the 2nd row to the last row ($)

{/header_pattern/d;} tells sed to find all lines in the range with the text "header_pattern" in them and d delete these lines

In order to make sure this was doing what I wanted it to do, I first printed the lines I was planning to delete.

"c:\Program Files (x86)\GnuWin32\bin\sed.exe" -n 2,${/header_pattern/p;} master_file.csv

Works like a charm, I just wish I could do it all in R.

Brian D
  • 2,570
  • 1
  • 24
  • 43
  • I only skimmed the question and answer so far ... how about `fread`? – Matt Dowle Dec 20 '13 at 23:19
  • I looked up `fread` and it had a note that it "wasn't for production use yet" so I passed on it. – Brian D Dec 21 '13 at 01:12
  • `fread()` works fine in my experience. Make sure you have virtual memory turned on with something like `memory.limit(size = 6 * 8192)`. This is slow but works. (And add 'buy an SSD' to your 'buy RAM' list ;)). – Peter Dec 22 '13 at 12:28
  • 1
    FYI, calling `gc()` explicitly [achieves nothing](http://adv-r.had.co.nz/memory.html#garbarge-collection) – hadley Dec 22 '13 at 16:56
  • @Peter How much RAM do you have and what size file are you loading? I suspect it's quite different to the numbers in the question but your comment is quite general about fread? – Matt Dowle Dec 23 '13 at 08:57
  • @BrianD I'd try fread in this case. Nothing to lose. The note about non-production use is to do with building code dependencies in your code (arguments may still possibly change in future). It isn't unstable in the non-working sense. – Matt Dowle Dec 23 '13 at 09:02
  • 5
    @hadley 'Achieves nothing' seems a little strong. Never, ever? How have you measured what `gc()` has done? You refer in that article 'despite what you have read elsewhere' but don't go into detail - are they less intelligent than you or simply all mistaken? There are different levels of garbage collection I believe, a deeper one is triggered every 20 (iirc) - were you aware of that? Are you absolutely sure about what you write? – Matt Dowle Dec 23 '13 at 09:41
  • 1
    @MattDowle My comment about `fread()` was that it has always worked fine for me, and is not linked to the amount of RAM. My problem was similar to that in the question, just a bit larger (25 tab-separated files, 220-450MB uncompressed, ~ 9GB total, also on Win 7 64-bit, on an Intel Core 2 Quad Q6600 with 8GB RAM). In his case probably 6*RAM is overkill by far (I just copy&pasted, and I need more RAM because I was reshaping and doing other stuff after the importing). I did everything in R as in my case the files had different column order. – Peter Dec 23 '13 at 10:12
  • @hadley I do not know how `gc()` works exactly (will read what you linked to) but my informal impression was that it was never triggered automatically. Code where I used it manually after every RAM-intensive operation did seem to work faster. Maybe because of the virtual memory I had to use? Too bad I moved to a machine with 32GB of RAM and cannot provide exact tests to support this... – Peter Dec 23 '13 at 10:15
  • 1
    @Peter gc() is always triggered automatically when needed. Otherwise R would run out of memory very very quickly. If you're seeing any differences, it's probably the placebo effect. – hadley Dec 23 '13 at 22:01
  • @MattDowle in this case I think the burden of proof is on the accuser: you need to provide evidence that R's garbage collector is broken and needs to be run manually. – hadley Dec 23 '13 at 22:03
  • @hadley The manual `?gc` says that it triggers a collect and also that it can be useful to call manually. Are you saying the manual is wrong? If so, how do you suggest it be improved? – Matt Dowle Dec 24 '13 at 09:02
  • @Peter If you were loading 9GB into 8GB then of course you'll have been swapping and that's what lead you to recommend SSD. That's nothing to do with `fread`. R runs within RAM. If you go larger than RAM then you'll swap and that's slow. Your original comment seems to be about R (and any in-RAM software) nothing to do with `fread`. `fread` is intended to load data into RAM. It doesn't help for data larger than RAM. RAM is very cheap; e.g. some `data.table` users have 512GB of RAM (0.5TB). Glad you have now quadrupled to 32GB. – Matt Dowle Dec 24 '13 at 09:19
  • @MattDowle Yes, of course you are absolutely correct, and I am aware of what you write. The 'slow' part referred to using virtual memory, not to `fread`. Maybe I should have posted the virtual memory part under the question, not under the solution. Unfortunately I cannot edit, just delete it now (and will do so if you want me to). `data.table` has saved me a lot of time, a big "thank you" to you! The SSD recommendation was an intermediate solution, my old PC already had the max supported 8GB of RAM, and it was cheaper to get an SSD than a whole new PC (which I eventually could afford/justify). – Peter Dec 24 '13 at 11:07
  • @MattDowle It always says "This will also take place automatically without user intervention, and the primary purpose of calling gc is for the report on memory usage", and calling it only "may prompt R to return memory to the operating system" (which is a pretty weak claim). So I think the help is balanced. – hadley Dec 24 '13 at 14:13
  • 2
    @hadley [This](https://dl.dropboxusercontent.com/u/74885015/test_gc_hadley.R) seems to run noticeably faster with manual `gc()`. Just tested on a virtual machine (Win XP 32 bit, 512MB RAM + 3GB swap file). The number of lines `fread` reads is important to generate the different times. I do not know whether due to R- or to OS-specific stuff. Could be I am doing something wrong (the code is far from pretty and efficient) but does not look like placebo. Maybe suboptimal threshold to run `gc()` in a low-RAM, high-swap-file scenario? – Peter Dec 24 '13 at 14:25
  • 2
    @Hadley If you're claiming that the manual for `?gc` makes a weak claim then it's for you to show that. From what I can see, you've just agreed that calling `gc()` _may do something sometimes_. This is all that's needed to counter your (strong) statement that "calling gc() explicitly achieves nothing". Since you're a scientist I'm challenging your scientific reasoning in this case. You haven't answered my question as to whether you were aware of the different levels of garbage collection and that a deeper collect happens every 20 (iirc). Do you have any support from anyone else for your claim? – Matt Dowle Dec 24 '13 at 16:49
  • 1
    @Peter Interesting. In R internals the "large vector heap" is different to small vectors (iirc something like 4k bytes, but don't quote me). I know very little about this area but this may explain why you noticed it depends on the number of rows in the file; i.e., so that vectors (the columns) are big enough to be allocated on the large vector heap. Further settings to fine tune the garbage collection trigger point are in `?Memory`. – Matt Dowle Dec 24 '13 at 16:59
  • 1
    @BrianD To avoid copy&paste of code (the `write.table()` part is almost identical) you can define a variable `if(first time through inner loop) append.to.file<-TRUE else append.to.file<-FALSE` and call `write.table(append=append.to.file,...)`. Easier to read, and also if you decide to change the output format in the future. – Peter Dec 26 '13 at 09:05
  • @hadley for some of my R code, I use an explicit call to `gc` to make sure R doesn't hog up resources (for other processes to run smoothly in parallel with R). This is a different question from "does calling `gc` add value when you run R and only R", but is still an important use of `gc` for my needs. – eddi Jan 02 '14 at 17:22
  • @eddi that's a valid reason, but I believe will only work on some operating systems. – hadley Jan 02 '14 at 19:18
  • @hadley it works on both linux and windows for me (a quick test is to check R memory usage after creating and deleting a large object and running `gc` afterwards) – eddi Jan 02 '14 at 19:48
  • Consider using the new `fwrite`, see [31. in the news for 1.9.7](https://github.com/Rdatatable/data.table) and [installation instructions](https://github.com/Rdatatable/data.table/wiki/Installation) for the latest version – MichaelChirico Apr 08 '16 at 04:47