2

I am trying to read in and manipulate data that I have stored in large data sets. Each file is about 5GB. I mostly need to be able to grab chunks of specific data out of these data sets. I also have a similar 38 MB file that I use for testing. I initially used read.table to read in chunks of the file using 'nrows' and 'skip'. However, this process take s a huge amount of time because the act of skipping an increasing amount of rows is time consuming. Here is the code I had:

numskip = 0 #how many lines in the file to skip
cur_list = read.table("file.txt", header = TRUE, sep = ',',nrows = 200000, skip = numskip, col.names = col) #col is a vector of column names

I set this up in a while loop and increasing numskip to grab the next chunk of data, but as numskip increasing, the process slowed significantly.

I briefly tried using read.lines to read in data line by line, but a few threads pointed me towards the sqdl package. I wrote the following bit of code:

library(sqldf)
f = file("bigfile.txt")
dataset = sqldf("select * from f where CusomterID = 7127382") #example of what I would like to be able to grab

From what I understand, sqldf will allow me to use SQL queries to return sets of the data from the database without R doing anything, provided that the subset isn't then too big for R to handle.

The problem is that my 4GB machine runs out of memory when I run the large files (though not the smaller test file). I found this odd because I know that SQLite can handle much larger files than 5GB, and R shouldn't be doing any of the processing. Would using PostGreSQL help? do I just need a better machine with more RAM? Should I give up on sqldf and find a different way to do this?

To wrap this up, here's an example of the data I am working with:

"Project" "CustomerID" "Stamp" "UsagePoint" "UsagePointType" "Energy"  
21           110981       YY         40            Red          0.17
21           110431       YY         40            Blue         0.19
22           120392       YY         40            Blue         0.20
22           210325       YY         40            Red          0.12

Thanks

jlhoward
  • 58,004
  • 7
  • 97
  • 140
Dan M
  • 21
  • 2
  • 1
    Maybe http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r or http://stackoverflow.com/questions/26861951/using-read-csv-sql-to-select-multiple-values-from-a-single-column might help – user20650 Dec 30 '14 at 02:40
  • You need to create a database on the harddrive. Either using the argument: dbname = tempfile(), or create a permanent database. See JD Longs answer here: http://stackoverflow.com/a/1820610/1831980 – Rasmus Larsen Mar 07 '17 at 19:27

2 Answers2

1

Have you tried

dat <- read.csv.sql(file = "file.txt", "select * from file where CusomterID = 7127382")
user51855
  • 369
  • 1
  • 6
0

You're right about sqldf and there are a ton of other great big data tools in R, including big.memory.

Conversions to csv or json can help (use RJSONIO) and you can also first load your data into a relational, NoSQL, Hadoop, or Hive database and read it in via RODBC, which is what I'd highly recommend in your case.

Also see fread and the CRAN HPC Taskview.

Hack-R
  • 22,422
  • 14
  • 75
  • 131