I'm trying to read in a large file (~5GB) into R and process the data. I can successfully read in the entire 5GB file, but the trouble comes when I apply my processing. I don't have a great grasp on memory basics in R, and I'm hoping some of you can help me understand better.
Here is an example of what I'm running
file = fread("file.txt") #file.txt is 5GB of unprocessed data
t.str <-strptime(file$time, "%m/%d/%Y %H:%M:%S"")#convert column to date class
month = as.numeric(format(t.str, "%m"))#create vector from file column
high = ifelse(file$age>70,1,0) #create vector from file column
#There are about ten more lines that operate on this file.
fread does a fine job of reading in the file. And the first three or four operations that I run on the 'file' data frame work. However, after a certain number of them run, I get an error that says:
C stack usage 19923892 is too close to the limit
I'm pretty sure the issue isn't a certain command I'm running since it worked on smaller data sets. I've read a bit on what stacks are, but this warning isn't totally making sense to me. Does this mean that R is using a pointer to run through these big vectors, and I've run out of pointer space (?). I read about a similar issue here:
Error: C stack usage is too close to the limit
One user suggested increasing the stack size in the shell. I tried looking in to this further, but I'm not sure how to proceed. Here is what they suggested:
$ ulimit -s # print default
8192
$ R --slave -e 'Cstack_info()["size"]'
size
8388608
Can anyone help me understand what this means, or just explain a bit about stack usage in R? Or does anyone know of a better way to process this data that doesn't exceed the stack usage? I'm not sure how to give you guys reproducible data.
Edit to add in example of data:
PersonID time Energy Age
1301839 07/24/2013 07:15:00 0.13 68
1301521 07/24/2013 07:30:00 0.19 68
1301890 07/24/2013 07:45:00 0.10 68
1301890 07/24/2013 08:00:00 0.06 68
1307112 07/24/2013 08:15:00 0.01 68