12

Suppose I have a connection to an external database called con.

I would like to use dplyr to reproduce this query

SELECT var1, var2, var3 from myTable LIMIT 10

I have tried

qry <- tbl(con, "myTable") %>%
    select(var1) %>%
    filter(between(row_number(), 1, 10)

but it does not give the desired result and the query it produces is much slower than the one I am after.

The query it produces is

SELECT "var1",
FROM SELECT "var1", row_number() OVER () AS "zzz25"
FROM SELECT "var1" AS "var1"
FROM myTable "yhydrqlhho") "zsdfunxidf"
WHERE zzz25 BETWEEN 1.0 AND 10.0)

When I send this query to the database it runs for a very long time. When I send

SELECT var1 from myTable limit 10 

the result comes back instantaneously.

Adam Black
  • 337
  • 3
  • 13
  • why would you want to use dplyr instead of using something like RODBC before editing the data using dplyr? – addicted Oct 27 '17 at 02:31
  • 3
    Your SQL query has a problem, because there is no `ORDER BY` clause. Saying `LIMIT 10` without an ordering is meaningless. – Tim Biegeleisen Oct 27 '17 at 02:31
  • 4
    `dplyr` (well, really `dbplyr`) will often generate sub-optimal SQL code b/c it has no full query analysis engine or other real SQL & back-end DB heuristics to do that. It's main goal is to make working with databases just like using normal `dplyr` for internal R data frames. You can actually do `tbl(con, "SQL STATEMENT")` to use an optimized query that returns a table `dplyr` can use. – hrbrmstr Oct 27 '17 at 02:35
  • Out of curiosity, does swapping the `filter` and `select` statements in your dplyr code lead to a speedup? – Scott Ritchie Oct 27 '17 at 02:47
  • @hrbrmstr Interesting. Thanks for the tip. 98% of the time the SQL that dplyr creates works great but I'm having trouble getting the top 10 rows of a huge table efficiently with dplyr. It seems like this would be easy. – Adam Black Oct 27 '17 at 02:50
  • @ScottRitchie Swapping filer and select produces basically the same query. – Adam Black Oct 27 '17 at 02:54
  • @TimBiegeleisen I guess I never realized I needed ordering since the query always gave me a result I was happy with, namely 10 rows of my table. – Adam Black Oct 27 '17 at 02:56
  • @addicted I much prefer querying large tables in a database using dplyr rather to writing SQL. I just don't have a lot of experience writing SQL and find I can write my queries much more succinctly using dplyr. – Adam Black Oct 27 '17 at 03:00
  • @AdamBlack yes - I'm wondering whether filter %>% select will get first 10 rows, then select appropriate columns, while select %>% filter presumably gets all rows in the DB for selected columns (i.e. very slow) then filters to first 10 rows. – Scott Ritchie Oct 27 '17 at 03:00
  • @MichaelChirico I edited the question to show what dplyr produces. I'm not sure why this is so much slower than using the LIMIT keyword but the execution time difference is substantial. – Adam Black Oct 27 '17 at 03:03
  • @ScottRitchie This is the SQL generated when I do filter then select: SELECT "var1", FROM (SELECT * FROM (SELECT *, row_number() OVER () AS "zzz25" FROM myTable "yhydrqlhho") "zsdfunxidf" WHERE zzz25 BETWEEN 1.0 AND 10.0) – Adam Black Oct 27 '17 at 03:15

2 Answers2

18

You can try head(10), it generates the correct sql query on Postgres:

tbl(con, 'my_table') %>% select(var1, var2) %>% head(6) %>% explain()
# here con is a PostgreSQL connection

#<SQL>
#SELECT "var1" AS "var1", "var2" AS "var2"
#FROM "my_table"
#LIMIT 6
Psidom
  • 209,562
  • 33
  • 339
  • 356
7

If you're after the actual data from your query, rather than just recreating the SQL query, then specifying collect(n=10) will give the same output as @Psidom's answer.

tbl(con, 'my_table') %>% select(var1, var2) %>% collect(n=10)

# A tibble: 10 x 2
   var1  var2 
   <chr>       <dbl>   
 1 text1            87.8     
 2 text2            99.6    
 3 text3           100       
 4 text4            91.9     
 5 text5            76.8    
 6 text6            77.8    
 7 text7            77.2    
 8 text8            97.2  
 9 text9            97.5
10 text10            80.4

Note that the default in collect() is n = 1e+05, so if your data (after filtration) contains more rows, you'll need to specify collect(n=Inf) to retrieve it.

IsoBar
  • 405
  • 3
  • 10