4

Having trouble loading a large text file; I'll post the code below. The file is ~65 GB and is separated using a "|". I have 10 of them. The process I'll describe below has worked for 9 files but the last file is giving me trouble. Note that about half of the other 9 files are larger than this - about 70 GB.

# Libraries I'm using
library(readr)
library(dplyr)

# Function to filter only the results I'm interested in
f <- function(x, pos) filter(x, x[,41] == "CA")

# Reading in the file. 
# Note that this has worked for 9/10 files. 
tax_history_01 <- read_delim_chunked( "Tax_History_148_1708_07.txt", 
    col_types = cols(`UNFORMATTED APN` = col_character()), 
    DataFrameCallback$new(f), chunk_size = 1000000, delim = "|")

This is the error message I get:

Error: cannot allocate vector of size 81.3 Mb
Error during wrapup: could not allocate memory (47 Mb) in C function 'R_AllocStringBuffer'

If it helps, Windows says the file is 69,413,856,071 bytes and readr is indicating 100% at 66198 MB. I've done some searching and really haven't a clue as to what's going on. I have a small hunch that there could be something wrong with the file (e.g. a missing delimiter).

Edit: Just a small sample of the resources I consulted. More specifically what's giving me trouble is "Error during wrapup: ... in C function 'R_AllocStringBuffer' " - I can't find much on this error.

Some of the language in this post has led me to believe that the limit of a string vector has been reached and there possibly a parsing error. R could not allocate memory on ff procedure. How come?

Saw this post and it seemed I was facing a different issue. For me it's not really a calculations issue. R memory management / cannot allocate vector of size n Mb

I referred to this post regarding cleaning up my work space. Not really an issue within one import but good practice when I ran the script importing all 10. Cannot allocate vector in R of size 11.8 Gb

Just more topics related to this: R Memory "Cannot allocate vector of size N"

Found this too but it's no help because of machine restrictions due to data privacy: https://rpubs.com/msundar/large_data_analysis

Just reading up on general good practices: http://adv-r.had.co.nz/memory.html http://stat.ethz.ch/R-manual/R-devel/library/base/html/Memory-limits.html

maxo
  • 83
  • 2
  • 8
  • R needs to have everything in memory. Unless you are working with a workstation with 196Gigs of RAM then you're out of luck. – IRTFM Mar 02 '18 at 00:16
  • I'm reading in chunks at a time. Not the entire file. It's worked for 9 other files and half of them are bigger than this one. – maxo Mar 02 '18 at 00:19
  • Then you need to do searching on memory management issues in R. There are lots of answered questions on SO regarding that topic. – IRTFM Mar 02 '18 at 00:24
  • Any suspicion why this has worked 9 times before with even a few files that are bigger? – maxo Mar 02 '18 at 00:33
  • R requires contiguous memory. Both R actions and system actions will cause the largest contiguous block to shrink over time. This is all explained in the multiple postings on this topic. Hence my downvote in exasperation. – IRTFM Mar 02 '18 at 00:40
  • 1
    That's certainly your prerogative. The half dozen or so searches I had done prior to posting this weren't much help. I didn't think it was a memory management issue because each chunk being loaded in is well under my 32 GB RAM limit. The best I could find was a post suggesting the memory limit for a single string vector hence the suspicion a delimiter is missing somewhere. And, it's worked on 9 other files of identical structure after calling gc() each time after. – maxo Mar 02 '18 at 01:03
  • If you edit your question (perhaps to include a summary of SO searches you had done?) I'll be able to reverse my downvote. – IRTFM Mar 02 '18 at 04:21
  • For the record I do appreciate your reference to contiguous memory. That hadn't come up in any of the resources I consulted so that wasn't on the radar of issues to address. – maxo Mar 02 '18 at 04:44
  • The error message you got might not appear if you were running this script with a clean session (reboot and restart R) with no other applications in your OS workspace. – IRTFM Mar 02 '18 at 05:24
  • I did do that but this was before I very recently realized that the column I specify in the code was not specified correctly. It's a string but made of numbers so the leading zeros were dropped. I reran the script and again this file in particular failed - all others were filtered and written as required. I am removing each file after reading and writing and clearing garbage for good measure. Based on the size of the file reported by Windows and the status that readr provides, I think the whole file is processed. But right after it gives me the error message above and the object isn't created. – maxo Mar 02 '18 at 05:32
  • For what's it's worth, I did try very quickly to query this using DB Browser for SQLite and it failed to load the file. – maxo Mar 02 '18 at 05:33

2 Answers2

2

Look at how wide the files are. If this is a very wide file, then your chunk_size = 1000000 could be making this the biggest single chunk that gets read in at one time, even if its not the biggest overall file.

Also, ensure that you're freeing (rm) the previous blocks read in, so that memory is returned and becomes available again. If you're relying on the overwriting of the previous chunk, then you've effectively doubled the memory requirements.

dsz
  • 4,542
  • 39
  • 35
1

I just ran into this error - I went through maxo's links, read the comments, and still no solution.

Turns out, in my case, the csv I was reading had been corrupted during the copy (checked this using an md5sum check, which - in hindsight - I should have done right away).

I'm guessing what happened, was that due to the nature of the corrupted data, there was an open quote without its corresponding closing quote, leading to the rest of the file being read in as one VERRRRYY LARRRGE string. That's my guess.

Anyway, hope this helps someone in the future :-).

orrymr
  • 2,264
  • 4
  • 21
  • 29