0

I have a dataset with 5G lines, too big to import as-is in R-base. My understanding is that this limit arises from the use of 32-bit indexes on vectors. As a result, vectors up to 2^31 - 1 are allowed, even in a 64bits version of R.

So I am exploring the ff package. This packages is said to be able to handle 'large' datasets here and here for example.

But it breaks when the 2^32 th row is reached. Am I missing something ?

library(ff)

probas.ff <- read.csv.ffdf(file="result.csv.gz"
                           ,header=FALSE
                           ,colClasses=c('numeric')
                           ,col.names=c('proba')
                           ,first.rows=100000
                           ,VERBOSE=TRUE
                           )

This produces the following error:

read.table.ffdf 2143389345..2145486496 (2097152)  csv-read=0.498sec ffdf-write=0.02sec
read.table.ffdf 2145486497..NA (2097152)  csv-read=0.411sec
Error in if (v1 == d1) return(x) : missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In N + n : NAs produced by integer overflow
2: In nff + n : NAs produced by integer overflow

FYI, as a plan B, I'm using a divide and conquer strategy using RMySQL that works well but the code is a bit ugly (comments welcome). As a first step, I simply want to plot a histogram.

library(RMySQL)
con <- dbConnect(RMySQL::MySQL(), dbname = "blah")

rs1 <- dbSendQuery(con, "SELECT proba FROM xxx LIMIT 0,1999999999")
data1 <- dbFetch(rs1, n=-1)

rs2 <- dbSendQuery(con, "SELECT proba FROM xxx LIMIT 2000000000,1999999999")
data2 <- dbFetch(rs2, n=-1)

rs3 <- dbSendQuery(con, "SELECT proba FROM xxx LIMIT 4000000000,1999999999")
data3 <- dbFetch(rs3, n=-1)

breaks <- seq(0,1,0.001)

h1 <- hist(data1$proba,breaks=breaks,plot=FALSE)
h2 <- hist(data2$proba,breaks=breaks,plot=FALSE)
h3 <- hist(data3$proba,breaks=breaks,plot=FALSE)

all.counts <- as.double(h1$counts) + as.double(h2$counts) + as.double(h3$counts)

png("hist.png")
barplot(all.counts)
dev.off()
Olivier Delrieu
  • 742
  • 6
  • 16
  • *Don't* try to read everything in memory. It's as simple as that. You are using MySQL. Use SQL queries to calculate the data you want. Loading everything in memory isn't necessarily *faster*. By the time you finish loading the data, before you can start processing it, the database would have returned the final result – Panagiotis Kanavos Nov 10 '17 at 15:02
  • esp with features like this available http://mysqlserverteam.com/histogram-statistics-in-mysql/ – hrbrmstr Nov 10 '17 at 15:03
  • If you want to process a *lot* of data in any program, not just R, create an algorithm that can work on a stream of incoming data. For example, to calculate an average you don't need to load everything in memory. You can read each record and update a sum and a count variable. No matter how many rows you process, by the time you finish reading the data, the average will be ready. That's how analytic functions in databases work by the way – Panagiotis Kanavos Nov 10 '17 at 15:06
  • Finally, to import data in the database from CSVs, don't write your own code. Use the database's bulk import tools. They are made to load data in the fastest way possible, with minimum overhead. – Panagiotis Kanavos Nov 10 '17 at 15:08
  • Thanks - I should not have included the mySQL code (which works btw)... my question is about the FF package. – Olivier Delrieu Nov 10 '17 at 15:12
  • @OlivierDelrieu loading everything in RAM isn't going to make things go *faster*. Don't do it if you don't have to. If you absolutely insist, use a 64-bit version of R, preferably Revolution R's distribution (now Microsoft's). It includes extensions that allow you to load more data than the memory available, by swapping data out to the disk as necessary. Again, your code will run *faster* if you *don't* have to load everything in memory before processing. – Panagiotis Kanavos Nov 10 '17 at 15:22
  • Thanks Panagiotis - I'm not after speed. I have access to machines with up to 2TB of RAM. I'm using a 64bits version of R. But R-base is nevertheless limited: data length could not be longer than 2^31-1. What I would like to know is whether FF can handle data bigger than that. Also, could you please elaborate on how to _stream_ data in R. That's something I can do with c++ and GPUs, not R. Thanks. – Olivier Delrieu Nov 10 '17 at 16:43

0 Answers0