3

I have a program in R that i have run for about a day now and its only reached about 10 percent completion. The main source of slowness comes from having to make thousands of sqldf(SELECT ...) calls from a data set of length ~ 1 million using the R package sqldf. My select statements currently take the following form:

sqldf(SELECT V1, V2, FROM mytable WHERE cast(start as real) <= sometime and cast(realized as real) > sometime)

sometime is just some integer representing a unix timestamp, and start and realized are columns of mytable that are also filled with unix timestamps entries. What i additionally know however is that |realized - start| < 172800 always, which is quite a small period as the dataset spans over a year. My thought is that I should be able to exploit this fact to tell R to only check the dataframe from time +- 172800 in each of these calls.

Is the package sqldf inappropriate to use here? Should i be using a traditional [,] traversal of the data.frame? Is there an easy way to incorporate this fact to speed up the program? My gut feeling is to break up the data frame, sort the vectors, and then build custom functions that traverse and select the appropriate entries themselves, but I'm looking for some affirmation if this is the best way.

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
mt88
  • 2,855
  • 8
  • 24
  • 42
  • 3
    You may want to consider using the `data.table` package which allows you to build true indices for your tables so searching is much quicker. – MrFlick Jul 08 '14 at 22:07
  • Beyond very general advice like the above, there really isn't much anyone can do to help since we aren't sitting at your computer looking at _all_ your code. – joran Jul 08 '14 at 22:16
  • I see what you mean. I'm more interested in any information on how sqldf search works and if their is a way to incorporate a condition like the one i have into speeding up searches. It seems like their should be a way to tell R to only search from specific indices, i'm looking into data.table package now. – mt88 Jul 08 '14 at 22:30
  • 1
    Try to get rid of the cast so that indexes can work and add indexes. There are examples on the sqldf home page. Otherwise, try data.table. – G. Grothendieck Jul 08 '14 at 22:35
  • `data.table` does NOT "build true indices". See [this question/answer](http://stackoverflow.com/questions/20076511/data-table-does-setkey-create-an-index-or-physically-reorder-the-rows-in). It is faster though. – jlhoward Jul 08 '14 at 23:54

1 Answers1

4

First, the slow part is probably cast(...), so rather than doing that twice for each record, in each query, why don't you leave start and realized as timestamps, and change the query to accommodate that.

Second, the data.table option is still about 100 times faster (but see the bit at the end about indexing with sqldf).

library(sqldf)
library(data.table)

N <- 1e6
# sqldf option
set.seed(1)
df <- data.frame(start=as.character(as.POSIXct("2000-01-01")+sample(0:1e6,N,replace=T)),
                 realized=as.character(as.POSIXct("2000-01-01")+sample(0:1e6,N,replace=T)),
                 V1=rnorm(N), V2=rpois(N,4))

sometime <- "2000-01-05 00:00:00"
query <- "SELECT V1, V2 FROM df WHERE start <= datetime('%s') and realized > datetime('%s')"
query <- sprintf(query,sometime,sometime)          
system.time(result.sqldf <- sqldf(query))
#    user  system elapsed 
#   12.17    0.03   12.23 

# data.table option
set.seed(1)
DT <- data.table(start=as.POSIXct("2000-01-01")+sample(0:1e6,N,replace=T),
                 realized=as.POSIXct("2000-01-01")+sample(0:1e6,N,replace=T),
                 V1=rnorm(N), V2=rpois(N,4))
setkey(DT,start,realized)
system.time(result.dt <- DT[start<=as.POSIXct(sometime) & realized > as.POSIXct(sometime),list(V1,V2)])
#    user  system elapsed 
#    0.15    0.00    0.15 

Note that the two result-sets will be sorted differently.

EDIT Based on comments below from @G.Grothendieck (author of the sqldf package).

This is turning into a really good comparison of the packages...

# code from G. Grothendieck comment
sqldf()      # opens connection
sqldf("create index ix on df(start, realized)")
query <- fn$identity("SELECT V1, V2 FROM main.df WHERE start <= '$sometime' and realized > '$sometime'")
system.time(result.sqldf <- sqldf(query))
sqldf()      # closes connection
#    user  system elapsed 
#    1.28    0.00    1.28 

So creating an index speeds sqldf by about a factor of 10 in this case. Index creation is slow but you only have to do it once. "key" creation in data.table (which physically sorts the table) is extremely fast, but does not improve performace all that much in this case (only about a factor of 2).

Benchmarking using system.time() is a bit risky (1 data point), so it's better to use microbenchmark(...). Note that for this to work, we have to run the code above and leave the connection open (e.g., remove the last call the sqldf().)

f.dt    <- function() result.dt <- DT[start<=as.POSIXct(sometime) & realized > as.POSIXct(sometime),list(V1,V2)]
f.sqldf <- function() result.sqldf <- sqldf(query)
library(microbenchmark)
microbenchmark(f.dt(),f.sqldf())
# Unit: milliseconds
#       expr      min        lq    median       uq       max neval
#     f.dt() 110.9715  184.0889  200.0634  265.648  833.4041   100
#  f.sqldf() 916.8246 1232.6155 1271.6862 1318.049 1951.5074   100

So we can see that, in this case, data.table using keys is about 6 times faster than sqldf using indexes. The actual times will depend on the size of the result-set, so you might want to compare the two options.

jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • 1
    data.table may be faster but I doubt its that much faster. The benchmark is badly biased in favor of data.table because (1) it uses indexes for data.table and not sqldf, (2) it excludes the time to create the index and possibly (3) due to the order the two are run. If I create an index on sqldf and to be comparable exclude the time for that from the timing then data.table takes 0.05 sec and sqldf takes 0.11 seconds. – G. Grothendieck Jul 09 '14 at 03:19
  • 3
    Here is the code with some improvements: `sqldf(); sqldf("create index ix on df(start, realized)"); query <- fn$identity("SELECT V1, V2 FROM main.df WHERE start <= '$sometime' and realized > '$sometime'"); system.time(result.sqldf <- sqldf(query)); sqldf()` – G. Grothendieck Jul 09 '14 at 03:21
  • @G.Grothendieck Your point (1) is well taken. See the edits above. – jlhoward Jul 09 '14 at 18:23
  • Oh man, data.table worked so well. My program just completed in 20 minutes. Thank you very much. I'll remember to take care when casting and calling sqldf in the future. I wasn't aware of how exepnsive these two calls could be. – mt88 Jul 09 '14 at 20:27