I am trying to read in and manipulate data that I have stored in large data sets. Each file is about 5GB. I mostly need to be able to grab chunks of specific data out of these data sets. I also have a similar 38 MB file that I use for testing. I initially used read.table to read in chunks of the file using 'nrows' and 'skip'. However, this process take s a huge amount of time because the act of skipping an increasing amount of rows is time consuming. Here is the code I had:
numskip = 0 #how many lines in the file to skip
cur_list = read.table("file.txt", header = TRUE, sep = ',',nrows = 200000, skip = numskip, col.names = col) #col is a vector of column names
I set this up in a while loop and increasing numskip to grab the next chunk of data, but as numskip increasing, the process slowed significantly.
I briefly tried using read.lines to read in data line by line, but a few threads pointed me towards the sqdl package. I wrote the following bit of code:
library(sqldf)
f = file("bigfile.txt")
dataset = sqldf("select * from f where CusomterID = 7127382") #example of what I would like to be able to grab
From what I understand, sqldf will allow me to use SQL queries to return sets of the data from the database without R doing anything, provided that the subset isn't then too big for R to handle.
The problem is that my 4GB machine runs out of memory when I run the large files (though not the smaller test file). I found this odd because I know that SQLite can handle much larger files than 5GB, and R shouldn't be doing any of the processing. Would using PostGreSQL help? do I just need a better machine with more RAM? Should I give up on sqldf and find a different way to do this?
To wrap this up, here's an example of the data I am working with:
"Project" "CustomerID" "Stamp" "UsagePoint" "UsagePointType" "Energy"
21 110981 YY 40 Red 0.17
21 110431 YY 40 Blue 0.19
22 120392 YY 40 Blue 0.20
22 210325 YY 40 Red 0.12
Thanks