1

I am sending a simple query to a MySQL database using dbplyr:

library(dbplyr)

my_conn<-dbConnect(...)
tab<-tbl(my_conn, "myTable")

tab %>% select(id, date, type) %>%
    filter(type = 'foobar')

However, if I inspect the generated SQL with show_query(), I get this:

SELECT *
FROM (SELECT `id`, `date`, `type`
FROM `myTable`) `q01`
WHERE (`type` == 'foobar')

This query is very slow to execute.

If I instead execute the following SQL command on the server:

SELECT id, date, type FROM db.myTable WHERE type = 'foobar'

then the return is nearly instantaneous.

My question is: why is dbplyr doing SELECT * (i.e. select all) and then doing a nested select in line 2? Also, why has "q01" appeared in this query? Could this be why my query is very slow to run compared to executing the minimal command directly on the server? I would perhaps expect dbplyr to create inefficient SQL queries as the complexity of the operations increases but--in this case at least--I can't really write a more succinct set of operations. It's just a select and filter (SELECT and WHERE).

Simon
  • 991
  • 8
  • 30

2 Answers2

4

dbplyr is generating the SQL query as I would expect. What it has done is one query inside another:

SELECT id, date, type FROM myTable

Is a subquery in the super query

SELECT *
FROM (
   subquery
) q01
WHERE type = foobar

The q01 is the name given to the subquery. In the same way as the AS keyword. For example: FROM very_long_table_name AS VLTN.

Yes, this nesting is ugly. But many SQL engines have a query optimizer that calculates the best way to execute a query. On SQL Server, I have noticed little difference in performance because the query optimizer finds a faster way to execute than as written.

However, it appears that for MySQL, nested queries are known to result in slower performance. See here, here, and here.

One thing that might solve this is changing the order of the select and filter commands in R:

tab %>%
  filter(type = 'foobar') %>%
  select(id, date, type)

Will probably produce the translated query:

SELECT `id`, `date`, `type`
FROM `myTable`
WHERE (`type` == 'foobar')

Which will perform better.

Simon.S.A.
  • 6,240
  • 7
  • 22
  • 41
-1

I think you want to string it all together. Once you do tab <- tbl(my_conn, "myTable"), you've downloaded the whole table.

tbl(my_conn, "myTable") %>% 
  select(id, date, type) %>%
  filter(type = 'foobar')
Brian Montgomery
  • 2,244
  • 5
  • 15
  • `tab <- tbl(my_conn, "myTable")` does not download the whole table. It provides a connection to a remote database table. Hence, why dplyr commands are being translated to SQL: so they can run on the database. Calling `collect()` copies the data from the remote database into local R memory. But be caseful doing this as the database table may not fit in RAM. – Simon.S.A. Nov 11 '21 at 20:00
  • 1
    Thanks @Simon.S.A, I was definitely wrong about this. I'll leave this here so others don't fall into the same trap. It looks like the table `tab` is there. But it's a "lazy" table that doesn't bring down all the data until you request it. – Brian Montgomery Nov 11 '21 at 22:52
  • You're very welcome. Glad to hear you have new understanding from this. It is a common point of confusion. I support leaving the answer here hopefully it reduces confusion for those who come after us. – Simon.S.A. Nov 12 '21 at 02:08