I have a dataset with 5G lines, too big to import as-is in R-base. My understanding is that this limit arises from the use of 32-bit indexes on vectors. As a result, vectors up to 2^31 - 1 are allowed, even in a 64bits version of R.
So I am exploring the ff package. This packages is said to be able to handle 'large' datasets here and here for example.
But it breaks when the 2^32 th row is reached. Am I missing something ?
library(ff)
probas.ff <- read.csv.ffdf(file="result.csv.gz"
,header=FALSE
,colClasses=c('numeric')
,col.names=c('proba')
,first.rows=100000
,VERBOSE=TRUE
)
This produces the following error:
read.table.ffdf 2143389345..2145486496 (2097152) csv-read=0.498sec ffdf-write=0.02sec
read.table.ffdf 2145486497..NA (2097152) csv-read=0.411sec
Error in if (v1 == d1) return(x) : missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In N + n : NAs produced by integer overflow
2: In nff + n : NAs produced by integer overflow
FYI, as a plan B, I'm using a divide and conquer strategy using RMySQL that works well but the code is a bit ugly (comments welcome). As a first step, I simply want to plot a histogram.
library(RMySQL)
con <- dbConnect(RMySQL::MySQL(), dbname = "blah")
rs1 <- dbSendQuery(con, "SELECT proba FROM xxx LIMIT 0,1999999999")
data1 <- dbFetch(rs1, n=-1)
rs2 <- dbSendQuery(con, "SELECT proba FROM xxx LIMIT 2000000000,1999999999")
data2 <- dbFetch(rs2, n=-1)
rs3 <- dbSendQuery(con, "SELECT proba FROM xxx LIMIT 4000000000,1999999999")
data3 <- dbFetch(rs3, n=-1)
breaks <- seq(0,1,0.001)
h1 <- hist(data1$proba,breaks=breaks,plot=FALSE)
h2 <- hist(data2$proba,breaks=breaks,plot=FALSE)
h3 <- hist(data3$proba,breaks=breaks,plot=FALSE)
all.counts <- as.double(h1$counts) + as.double(h2$counts) + as.double(h3$counts)
png("hist.png")
barplot(all.counts)
dev.off()