1

How to fast search inside a data.table using sqldf?

I need a function that returns a value of a column data.table based on two other column values:

require(data.table)

dt <- data.table(
    "base" = c("of", "of", "of", "lead and background vocals", "save thou me from", "silent in the face"),
    "prediction" = c("the", "set", "course", "from", "the", "of"),
    "count" = c(258586, 246646, 137533, 4, 4, 4)
)

> dt
#                         base prediction  count
#1:                         of        the 258586
#2:                         of        set 246646
#3:                         of     course 137533
#4: lead and background vocals       from      4
#5:          save thou me from        the      4
#6:         silent in the face         of      4

# the function needs to return the "prediction" value based on the max "count" value for the input "base" value.
# giving the input "of" to function:
> prediction("of")
# the desired output is:
> "the"
# or:
> prediction("save thou me from")
> "the"

The solution provided here How extract values of a data.table based on multiple conditions? works on small datasets, but not on very large data.table (57M obs):

f1 <- function(val) dt[base == val, prediction[which.max(count)]]

I read that indexing the data.table and searching with sqldf functions could speed it up, but don't know how to do this yet.

Thanks for advance.

Tung
  • 26,371
  • 7
  • 91
  • 115
  • can't use just use the sqldf package if you want to use sql – MatthewR Oct 14 '19 at 22:50
  • 2
    Try running `setkey(dt, base)`, then instead of `dt[base == val, ...]` use `dt[val, ...]` and see if it's faster. – IceCreamToucan Oct 14 '19 at 23:41
  • if u are running this repeatedly, you might want to run and store `keydt <- dt[, prediction[which.max(count)], keyby="base"]` for faster indexing – chinsoon12 Oct 15 '19 at 01:25
  • B. Christian Kamgang answer is pretty good one. If possible I suggest first to gather all values that you want to query later on using single subset on many values into smaller set and then continue to subset single elements from that one. – jangorecki Oct 15 '19 at 12:53

2 Answers2

4

You may consider using only data.table as follows. I think it could significantly improve speed.

dt <- data.table(
"base" = c("of", "of", "of", "lead and background vocals", "save thou me from", 
"silent in the face"),
"prediction" = c("the", "set", "course", "from", "the", "of"),
"count" = c(258586, 246646, 137533, 4, 4, 4)
)

# set the key on both base and count.
# This rearranges the data such that the max value of count for each group in base 
# corresponds to the last row.
setDT(dt, key = c("base", "count"))

# for a given group in base, we consider only the last value of prediction as it is 
# on the same row with the max value of count. 
prediction <- function(x) {
  dt[.(x), prediction[.N] ]
}

prediction("of")
#"the"
prediction("save thou me from")
#"the"
jangorecki
  • 16,384
  • 4
  • 79
  • 160
2

Using sqldf it would be like this. Add dbname = tempfile() argument if you can't fit it in memory.

library(sqldf)

val <- "of"
fn$sqldf("select max(count) count, prediction from dt where base = '$val'")
##   count prediction
##1 258586        the

Alternately, to set up a database directly using RSQLite and create an index:

library(gsubfn)
library(RSQLite)

con <- dbConnect(SQLite(), "dt.db")
dbWriteTable(con, "dt", dt)
dbExecute(con, "create index idx on dt(base)")

val <- "of"
fn$dbGetQuery(con, "select max(count) count, prediction from dt where base = '$val'")
##    count prediction
## 1 258586        the

dbDisconnect(con)

Note

Run this first:

library(data.table)

dt <- data.table(
    "base" = c("of", "of", "of", "lead and background vocals", 
     "save thou me from", "silent in the face"),
    "prediction" = c("the", "set", "course", "from", "the", "of"),
    "count" = c(258586, 246646, 137533, 4, 4, 4)
)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Hi Grothendieck, I've got an error: > con <- dbConnect(SQLite(), "db.Rdata") Warning message: Couldn't set synchronous mode: file is not a database Use `synchronous` = NULL to turn off this warning. > dbWriteTable(con, "db.Rdata", dt) Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘dbWriteTable’ for signature ‘"SQLiteConnection", "character", "function"’ – Danilo Correa Oct 15 '19 at 00:01
  • STart a fresh session of R and make sure the current directory is writeable and that dt.db does not exist. Then copy and paste the code in the Note into R and then copy and paste the code in the body after Alternately into R. – G. Grothendieck Oct 15 '19 at 00:09