9
library(tidyverse)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")

I can create this mock SQL database above. And it's very cool that I can perform standard dplyr functions on this "database":

mtcars2 %>% 
  group_by(cyl) %>% 
  summarise(mpg = mean(mpg, na.rm = TRUE)) %>% 
  arrange(desc(mpg))
#> # Source:     lazy query [?? x 2]
#> # Database:   sqlite 3.29.0 [:memory:]
#> # Ordered by: desc(mpg)
#>     cyl   mpg
#>   <dbl> <dbl>
#> 1     4  26.7
#> 2     6  19.7
#> 3     8  15.1

It appears I'm unable to use dplyr functions that have no direct SQL equivalents, (eg dplyr::slice()). In the case of slice() I can use the alternative combination of filter() and row_number() to get the same results as just using slice(). But what happens when there's not such an easy workaround?

mtcars2 %>% slice(1:5)
#>Error in UseMethod("slice_") : 
#>  no applicable method for 'slice_' applied to an object of class 
#>  "c('tbl_SQLiteConnection', 'tbl_dbi', 'tbl_sql', 'tbl_lazy', 'tbl')"

When dplyr functions have no direct SQL equivalents can I force their use with dbplyr, or is the only option to get creative with dplyr verbs that do have SQL equivalents, or just write the SQL directly (which is not my preferred solution)?

Display name
  • 4,153
  • 5
  • 27
  • 75
  • 1
    I don't know if this is preferred either but could you just pull in a larger dataframe into memory and then use `dplyr` verbs with no equivalents on the r object instead of the sql table? – Mike Dec 06 '19 at 17:51
  • @Mike I'm open to that approach. How would that work in practice? Do you want to draft up an answer? – Display name Dec 06 '19 at 18:25
  • 1
    You just need to add a `collect() %>%` line before `slice` (or other non-translatable fun) to bring it into memory first, if that's how you want to do it – IceCreamToucan Dec 06 '19 at 18:28
  • @IceCreamToucan oah... the approach makes sense now... I've utilized `collect()` before but am concerned it could sometimes be an issue when data frames are too large. A good suggestion none the less. – Display name Dec 06 '19 at 18:48
  • 1
    I think that my suggestion was going to be similar: `mt4 <- RSQLite::dbReadTable(conn = con, "mtcars") %>% slice(1:5)` . If memory will be an issue I would consider doing a hybrid approach where you select and filter rows in SQL then pull in the smaller subset using `collect` – Mike Dec 06 '19 at 19:13
  • 2
    what function besides for slice is causing you issues? – Mike Dec 06 '19 at 19:14
  • What I ultimately want to do is `View()` SQL database tables at a quick glance in R Studio. After I retrieve a table with `tbl(con, "TABLE_NAME")` I can see the standard `tibble()` output of the first ten rows, the first several columns, with the trailing columns listed in sequence. It often helps _for me_ to see this in spreadsheet view with `... %>% View()` but that doesn't work with dbplyr because the first row is listed as `??`. I wanted to use `head()` or `slice()` to quickly look at say the first 100 rows of a SQL table, but things don't seem to work this way in the dbplyr world. – Display name Dec 06 '19 at 19:46
  • 1
    got ya. This works for me if you are trying to get a `View` of your data: `View(mtcars2 %>% do(head(.,5)))` – Mike Dec 06 '19 at 20:40
  • 1
    @Mike just tried that on my SQL Table that's 250K rows by 200 columns and my computers been chugging along for 15 minutes with no response. That seems like a good answer but just isn't working. – Display name Dec 06 '19 at 21:48

1 Answers1

12

I understood this question: How can I make slice() work for SQL databases? This is different from "forcing their use" but still might work in your case.

The example below shows how to implement a "poor man's" variant of slice() that works on the database. We still need to do the legwork and implement it with verbs that work on the database, but then we can use it similarly to data frames.

Read more about S3 classes in http://adv-r.had.co.nz/OO-essentials.html#s3.

library(tidyverse)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")

# mtcars2 has a class attribute
class(mtcars2)
#> [1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"             
#> [4] "tbl_lazy"             "tbl"

# slice() is an S3 method
slice
#> function(.data, ..., .preserve = FALSE) {
#>   UseMethod("slice")
#> }
#> <bytecode: 0x560a03460548>
#> <environment: namespace:dplyr>

# we can implement a "poor man's" variant of slice()
# for the particular class. (It doesn't work quite the same
# in all cases.)
#' @export
slice.tbl_sql <- function(.data, ...) {
  rows <- c(...)

  .data %>%
    mutate(...row_id = row_number()) %>%
    filter(...row_id %in% !!rows) %>%
    select(-...row_id)
}

mtcars2 %>%
  slice(1:5)
#> # Source:   lazy query [?? x 11]
#> # Database: sqlite 3.29.0 [:memory:]
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
#> 5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2

Created on 2019-12-07 by the reprex package (v0.3.0)

krlmlr
  • 25,056
  • 14
  • 120
  • 217
  • Cool solution. The problem is that it creates a new row_id column under the hood, which can be expensive and cause overhead. Is there a workaround to that? – GitHunter0 Apr 22 '21 at 03:17
  • Can you quantify the overhead? The alternative would be to collect `max(rows)` rows and do the slice locally, could be faster in some cases. – krlmlr Apr 22 '21 at 11:47
  • I did not quantify the overhead but actually it seems not a great deal, still better than do it locally in most cases. Thanks for the feedback – GitHunter0 Apr 23 '21 at 02:10