4

I am currently writing a function with dplyr to do calculations. The function turns out to be really slow when I input a tbl_sql object, but reasonably fast when I input a data.frame.

An example,

df = data.frame(
  a = rnorm(1000000),
  b = rnorm(1000000),
  c = rep(1:5, 2000000),
  d = rep(1:10, 1000000)
)
library(dplyr)
library(MonetDBLite)
library(DBI)
mydb = dbConnect(MonetDBLite())
dbWriteTable(mydb, "df", df, overwrite = T)
dfdb = tbl(mydb, "df")

f = function(d, loc = F){
  d = d %>% mutate(i = a*b, ii = a/b)
  d2 = d %>% group_by(d) %>% summarise(sum(i)) %>% rename(k = d)
  d3 =  d %>% group_by(c) %>% summarise(sum(ii)) %>% rename(k = c)
  d4 = inner_join(d2, d3, by = "k")
  print(d4)
}

system.time(f(df)) takes less than 1 second. (data.frame)

system.time(f(dfdb)) it takes around 2 seconds. (tbl_sql)

My current function is longer and the inputted data is also larger, (data.frame) turns out to be more than 10 times faster than (tbl_sql) objects.

Why is that? Can anyone please explain.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
user9672798
  • 113
  • 1
  • 4
  • 1
    it's likely that it's mainly the importation of data from the sql db that is taking time, or that the sql translation is inefficient (but i don't think it would be the case, see `?show_query`). How slow is it if you print `d3`(small object) instead ? – moodymudskipper Apr 20 '18 at 19:11
  • 1
    Probably because df is in memory, whereas dfdb is on disk. – LetEpsilonBeLessThanZero Apr 20 '18 at 20:16

0 Answers0