First, the slow part is probably cast(...)
, so rather than doing that twice for each record, in each query, why don't you leave start
and realized
as timestamps, and change the query to accommodate that.
Second, the data.table
option is still about 100 times faster (but see the bit at the end about indexing with sqldf
).
library(sqldf)
library(data.table)
N <- 1e6
# sqldf option
set.seed(1)
df <- data.frame(start=as.character(as.POSIXct("2000-01-01")+sample(0:1e6,N,replace=T)),
realized=as.character(as.POSIXct("2000-01-01")+sample(0:1e6,N,replace=T)),
V1=rnorm(N), V2=rpois(N,4))
sometime <- "2000-01-05 00:00:00"
query <- "SELECT V1, V2 FROM df WHERE start <= datetime('%s') and realized > datetime('%s')"
query <- sprintf(query,sometime,sometime)
system.time(result.sqldf <- sqldf(query))
# user system elapsed
# 12.17 0.03 12.23
# data.table option
set.seed(1)
DT <- data.table(start=as.POSIXct("2000-01-01")+sample(0:1e6,N,replace=T),
realized=as.POSIXct("2000-01-01")+sample(0:1e6,N,replace=T),
V1=rnorm(N), V2=rpois(N,4))
setkey(DT,start,realized)
system.time(result.dt <- DT[start<=as.POSIXct(sometime) & realized > as.POSIXct(sometime),list(V1,V2)])
# user system elapsed
# 0.15 0.00 0.15
Note that the two result-sets will be sorted differently.
EDIT Based on comments below from @G.Grothendieck (author of the sqldf
package).
This is turning into a really good comparison of the packages...
# code from G. Grothendieck comment
sqldf() # opens connection
sqldf("create index ix on df(start, realized)")
query <- fn$identity("SELECT V1, V2 FROM main.df WHERE start <= '$sometime' and realized > '$sometime'")
system.time(result.sqldf <- sqldf(query))
sqldf() # closes connection
# user system elapsed
# 1.28 0.00 1.28
So creating an index speeds sqldf
by about a factor of 10 in this case. Index creation is slow but you only have to do it once. "key" creation in data.table (which physically sorts the table) is extremely fast, but does not improve performace all that much in this case (only about a factor of 2).
Benchmarking using system.time()
is a bit risky (1 data point), so it's better to use microbenchmark(...)
. Note that for this to work, we have to run the code above and leave the connection open (e.g., remove the last call the sqldf()
.)
f.dt <- function() result.dt <- DT[start<=as.POSIXct(sometime) & realized > as.POSIXct(sometime),list(V1,V2)]
f.sqldf <- function() result.sqldf <- sqldf(query)
library(microbenchmark)
microbenchmark(f.dt(),f.sqldf())
# Unit: milliseconds
# expr min lq median uq max neval
# f.dt() 110.9715 184.0889 200.0634 265.648 833.4041 100
# f.sqldf() 916.8246 1232.6155 1271.6862 1318.049 1951.5074 100
So we can see that, in this case, data.table
using keys is about 6 times faster than sqldf
using indexes. The actual times will depend on the size of the result-set, so you might want to compare the two options.