5

I'm trying to read in a large file (~5GB) into R and process the data. I can successfully read in the entire 5GB file, but the trouble comes when I apply my processing. I don't have a great grasp on memory basics in R, and I'm hoping some of you can help me understand better.

Here is an example of what I'm running

file = fread("file.txt") #file.txt is 5GB of unprocessed data
t.str <-strptime(file$time, "%m/%d/%Y %H:%M:%S"")#convert column to date class
month = as.numeric(format(t.str, "%m"))#create vector from file column
high = ifelse(file$age>70,1,0) #create vector from file column
#There are about ten more lines that operate on this file.

fread does a fine job of reading in the file. And the first three or four operations that I run on the 'file' data frame work. However, after a certain number of them run, I get an error that says:

C stack usage 19923892 is too close to the limit

I'm pretty sure the issue isn't a certain command I'm running since it worked on smaller data sets. I've read a bit on what stacks are, but this warning isn't totally making sense to me. Does this mean that R is using a pointer to run through these big vectors, and I've run out of pointer space (?). I read about a similar issue here:

Error: C stack usage is too close to the limit

One user suggested increasing the stack size in the shell. I tried looking in to this further, but I'm not sure how to proceed. Here is what they suggested:

$ ulimit -s # print default
8192
$ R --slave -e 'Cstack_info()["size"]'
size 
8388608

Can anyone help me understand what this means, or just explain a bit about stack usage in R? Or does anyone know of a better way to process this data that doesn't exceed the stack usage? I'm not sure how to give you guys reproducible data.

Edit to add in example of data:

PersonID     time              Energy   Age
1301839    07/24/2013 07:15:00  0.13    68
1301521    07/24/2013 07:30:00  0.19    68
1301890    07/24/2013 07:45:00  0.10    68
1301890    07/24/2013 08:00:00  0.06    68
1307112    07/24/2013 08:15:00  0.01    68
Community
  • 1
  • 1
Ore M
  • 247
  • 2
  • 10
  • Can you post a few lines of the data file? For starters, Simon's [fasttime](http://rforge.net/fasttime) will beat `strptime` etc pp – Dirk Eddelbuettel Mar 16 '15 at 20:56
  • Data example added. This package looks like it's faster, thanks. Does it deal with stacks differently? – Ore M Mar 16 '15 at 21:07
  • Ok, non-standard date format so you will have to parse it via `strptime`. Also your format string is `%Y-%d-%d`, but needs to be `%m/%d/%Y` (followed by `%H:%M:%S` as you have). – Dirk Eddelbuettel Mar 16 '15 at 21:07
  • Ah right, It's correct in my original code, I'll edit the correct format in – Ore M Mar 16 '15 at 21:15
  • The answer in the reference questions provide how you can change the limit to 16384 KB. `ulimit -s 16384 # enlarge stack limit to 16 megs` It's in the next box after the one you quoted. That's how to change. From the error you got, try something larger than 19,924 KB. Sorry, I can't really answer your request to understand what it means. – Andre Michaud Mar 16 '15 at 23:42
  • I know this is very basic, but in what shell do I enter that command? Where do I have to located? I tried in the bash shell, but that didn't work. I think I'm going to spend some time learning bash commands because it seems useful. Thanks for the answer and the clarification, I appreciate it. – Ore M Mar 17 '15 at 04:51

1 Answers1

3

Sorry, this really isn't an answer, but I don't have enough points to comment. You could try reading and processing the data in chunks, or check out some of the large memory packages in the CRAN Task View High Performance Computing . You can also read about memory usage here.

jentjr
  • 352
  • 2
  • 9
  • 1
    That's what I ended up doing. Once I have chunks that are small enough, it works, and it's not that slow. Thanks for the reference to High Performance Computing, I'll definitely check it out. – Ore M Mar 17 '15 at 04:49