Reading Large Files into Data Frames in R - Issues with sqldf

Question

I am trying to read in and manipulate data that I have stored in large data sets. Each file is about 5GB. I mostly need to be able to grab chunks of specific data out of these data sets. I also have a similar 38 MB file that I use for testing. I initially used read.table to read in chunks of the file using 'nrows' and 'skip'. However, this process take s a huge amount of time because the act of skipping an increasing amount of rows is time consuming. Here is the code I had:

numskip = 0 #how many lines in the file to skip
cur_list = read.table("file.txt", header = TRUE, sep = ',',nrows = 200000, skip = numskip, col.names = col) #col is a vector of column names

I set this up in a while loop and increasing numskip to grab the next chunk of data, but as numskip increasing, the process slowed significantly.

I briefly tried using read.lines to read in data line by line, but a few threads pointed me towards the sqdl package. I wrote the following bit of code:

library(sqldf)
f = file("bigfile.txt")
dataset = sqldf("select * from f where CusomterID = 7127382") #example of what I would like to be able to grab

From what I understand, sqldf will allow me to use SQL queries to return sets of the data from the database without R doing anything, provided that the subset isn't then too big for R to handle.

The problem is that my 4GB machine runs out of memory when I run the large files (though not the smaller test file). I found this odd because I know that SQLite can handle much larger files than 5GB, and R shouldn't be doing any of the processing. Would using PostGreSQL help? do I just need a better machine with more RAM? Should I give up on sqldf and find a different way to do this?

To wrap this up, here's an example of the data I am working with:

"Project" "CustomerID" "Stamp" "UsagePoint" "UsagePointType" "Energy"  
21           110981       YY         40            Red          0.17
21           110431       YY         40            Blue         0.19
22           120392       YY         40            Blue         0.20
22           210325       YY         40            Red          0.12

Thanks

Maybe http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r or http://stackoverflow.com/questions/26861951/using-read-csv-sql-to-select-multiple-values-from-a-single-column might help — user20650, Dec 30 '14 at 02:40
You need to create a database on the harddrive. Either using the argument: dbname = tempfile(), or create a permanent database. See JD Longs answer here: http://stackoverflow.com/a/1820610/1831980 — Rasmus Larsen, Mar 07 '17 at 19:27

score 1 · Answer 1 · answered Dec 30 '14 at 02:10

1

Have you tried

dat <- read.csv.sql(file = "file.txt", "select * from file where CusomterID = 7127382")

answered Dec 30 '14 at 02:10

user51855

369
1
6

score 0 · Answer 2 · answered Dec 30 '14 at 03:44

You're right about sqldf and there are a ton of other great big data tools in R, including big.memory.

Conversions to csv or json can help (use RJSONIO) and you can also first load your data into a relational, NoSQL, Hadoop, or Hive database and read it in via RODBC, which is what I'd highly recommend in your case.

Also see fread and the CRAN HPC Taskview.

Reading Large Files into Data Frames in R - Issues with sqldf

2 Answers2