Create the SQL query "SELECT * FROM myTable LIMIT 10" using dplyr

Question

Suppose I have a connection to an external database called con.

I would like to use dplyr to reproduce this query

SELECT var1, var2, var3 from myTable LIMIT 10

I have tried

qry <- tbl(con, "myTable") %>%
    select(var1) %>%
    filter(between(row_number(), 1, 10)

but it does not give the desired result and the query it produces is much slower than the one I am after.

The query it produces is

SELECT "var1",
FROM SELECT "var1", row_number() OVER () AS "zzz25"
FROM SELECT "var1" AS "var1"
FROM myTable "yhydrqlhho") "zsdfunxidf"
WHERE zzz25 BETWEEN 1.0 AND 10.0)

When I send this query to the database it runs for a very long time. When I send

SELECT var1 from myTable limit 10

the result comes back instantaneously.

why would you want to use dplyr instead of using something like RODBC before editing the data using dplyr? — addicted, Oct 27 '17 at 02:31
Your SQL query has a problem, because there is no `ORDER BY` clause. Saying `LIMIT 10` without an ordering is meaningless. — Tim Biegeleisen, Oct 27 '17 at 02:31
`dplyr` (well, really `dbplyr`) will often generate sub-optimal SQL code b/c it has no full query analysis engine or other real SQL & back-end DB heuristics to do that. It's main goal is to make working with databases just like using normal `dplyr` for internal R data frames. You can actually do `tbl(con, "SQL STATEMENT")` to use an optimized query that returns a table `dplyr` can use. — hrbrmstr, Oct 27 '17 at 02:35
Out of curiosity, does swapping the `filter` and `select` statements in your dplyr code lead to a speedup? — Scott Ritchie, Oct 27 '17 at 02:47
@hrbrmstr Interesting. Thanks for the tip. 98% of the time the SQL that dplyr creates works great but I'm having trouble getting the top 10 rows of a huge table efficiently with dplyr. It seems like this would be easy. — Adam Black, Oct 27 '17 at 02:50
@ScottRitchie Swapping filer and select produces basically the same query. — Adam Black, Oct 27 '17 at 02:54
@TimBiegeleisen I guess I never realized I needed ordering since the query always gave me a result I was happy with, namely 10 rows of my table. — Adam Black, Oct 27 '17 at 02:56
@addicted I much prefer querying large tables in a database using dplyr rather to writing SQL. I just don't have a lot of experience writing SQL and find I can write my queries much more succinctly using dplyr. — Adam Black, Oct 27 '17 at 03:00
@AdamBlack yes - I'm wondering whether filter %>% select will get first 10 rows, then select appropriate columns, while select %>% filter presumably gets all rows in the DB for selected columns (i.e. very slow) then filters to first 10 rows. — Scott Ritchie, Oct 27 '17 at 03:00
@MichaelChirico I edited the question to show what dplyr produces. I'm not sure why this is so much slower than using the LIMIT keyword but the execution time difference is substantial. — Adam Black, Oct 27 '17 at 03:03
@ScottRitchie This is the SQL generated when I do filter then select: SELECT "var1", FROM (SELECT * FROM (SELECT *, row_number() OVER () AS "zzz25" FROM myTable "yhydrqlhho") "zsdfunxidf" WHERE zzz25 BETWEEN 1.0 AND 10.0) — Adam Black, Oct 27 '17 at 03:15

score 18 · Accepted Answer · answered Oct 27 '17 at 03:05

You can try head(10), it generates the correct sql query on Postgres:

tbl(con, 'my_table') %>% select(var1, var2) %>% head(6) %>% explain()
# here con is a PostgreSQL connection

#<SQL>
#SELECT "var1" AS "var1", "var2" AS "var2"
#FROM "my_table"
#LIMIT 6

score 7 · Answer 2 · answered Jul 06 '18 at 02:05

If you're after the actual data from your query, rather than just recreating the SQL query, then specifying collect(n=10) will give the same output as @Psidom's answer.

tbl(con, 'my_table') %>% select(var1, var2) %>% collect(n=10)

# A tibble: 10 x 2
   var1  var2 
   <chr>       <dbl>   
 1 text1            87.8     
 2 text2            99.6    
 3 text3           100       
 4 text4            91.9     
 5 text5            76.8    
 6 text6            77.8    
 7 text7            77.2    
 8 text8            97.2  
 9 text9            97.5
10 text10            80.4

Note that the default in collect() is n = 1e+05, so if your data (after filtration) contains more rows, you'll need to specify collect(n=Inf) to retrieve it.

Create the SQL query "SELECT * FROM myTable LIMIT 10" using dplyr

2 Answers2