2

Calling the dplyr::arrange() on a table in a remote source adds an 'Ordered by: ...' flag. Is there a subsequent function that removes this 'Ordered by:' flag on the remote table?

Consider example data:

tmp_cars_sdf <-
    copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)

For which:

glimpse(tmp_cars_sdf)
# Observations: ??
#     Variables: 2
# Database: postgres 9.5.3
# $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13...
# $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26...

Consider:

tmp_cars <-
    cars
tmp_cars <-
    tmp_cars %>%
    arrange(speed, dist)
glimpse(tmp_cars)

# Observations: 50
# Variables: 2
# $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13,...
# $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34,...

However:

tmp_cars <-
    tmp_cars_sdf %>%
    arrange(speed, dist)
glimpse(tmp_cars)

# Observations: ??
#     Variables: 2
# Database: postgres 9.5.3 
# Ordered by: speed, dist
# $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13,...
# $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34,...
Alex
  • 15,186
  • 15
  • 73
  • 127
  • 2
    I think the statement is there because it hasn't actually ordered all the data yet. (Notice how `observations: ??`). It's trying to be as lazy as possible before the final query to collect the data. Not sure if that was exactly what you were asking though.. – heds1 Aug 07 '19 at 03:27
  • 1
    There's no inherent order to a SQL database; that's just telling you about the cached query that's been built so far. You could `arrange` by something else, but not having an `ORDER BY` clause just means your results may come back in any particular order. You'll get the same rows regardless of order whenever you call `collect()`. – alistaire Aug 07 '19 at 03:31
  • @heds1 Yes, I understand that the computation is delayed as much as possible. However, I want to use the data from the ordered table later without any ordering on it. The backstory is this: https://github.com/tidyverse/dplyr/issues/4522. – Alex Aug 07 '19 at 03:38
  • Essentially I am looking for some sort of `unarrange()` function, similar to how one would stop using `rowwise()`: https://stackoverflow.com/questions/29762393/how-does-one-stop-using-rowwise-in-dplyr – Alex Aug 07 '19 at 03:45

2 Answers2

2

dbplyr tends to nest subqueries with the addition of commands. Hence an earlier arrange is likely to turn up in subqueries as you add further commands. This appears to be the underlying problem.

One option for removing these is to render and edit the underlying SQL query directly. Perhaps something like the following:

unarrange = function(table, cols_prev_ordered_by){

  db_connection = table$src$con

  order_text = paste0("ORDER BY \"",
                      paste0(cols_prev_ordered_by, collapse = \", \""),
                      "\"")

  query_text = table %>% sql_render() %>% as.character()
  new_query_text = gsub(order_text, "", query_text)

  sql_query = build_sql(con = db_connection, new_query_text)
  return(tbl(db_connection, sql(sql_query)))
}

# example:
tmp_cars <-
    tmp_cars_sdf %>%
    arrange(speed, dist)
    unarrange(c("speed", "dist"))

There are certainly more robust approaches for identifying and removing the order-by part of the query than gsub. If this is important you might want to look at ?select_query as it has an explicit order_by argument.

Simon.S.A.
  • 6,240
  • 7
  • 22
  • 41
  • Thanks. Can I infer from your answer that a remote table is only represented by its underlying SQL? – Alex Aug 08 '19 at 01:23
  • 1
    Yes. I think of, and treat, remote tables as if they are just the SQL query that makes them. There may be more to it, but I have never encountered any situation where this was the case. – Simon.S.A. Aug 08 '19 at 09:17
1

Inspired by Simon's answer and comments on the OP, the following function is a work-around that removes all ordering (but preserves any new column computed as a result of the ordering). It may not be the most efficient or low-level/direct way of doing this, which I will come back to at the end of this answer, but I will let the dbplyr team resolve my issue if they see fit to do so.

unarrange <-
    function(remote_df) {

     existing_groups <- groups(remote_df)


        remote_df <-
            remote_df %>%
            compute()

         remote_df <-
            tbl(remote_df$src$con, 
                sql_render(remote_df))


         remote_df <-
             group_by(remote_df, !!!existing_groups)



         return(remote_df)

    }

Why it works

With the input data:

tmp_cars_sdf <-
    copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)

Consider

str(tmp_cars_sdf)
# ..$ con <truncated>
# ..$ disco <truncated>
# $ ops:List of 2
# ..$ x   : 'ident' chr "tmp_cars_sdf"
# ..$ vars: chr [1:2] "speed" "dist"
# ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
# - attr(*, "class")= chr [1:5] "tbl_PostgreSQLConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...

vs

tmp_cars_sdf <-
    tmp_cars_sdf %>%
    arrange(speed, dist)

str(tmp_cars_sdf)
# $ ops:List of 4
# ..$ name: chr "arrange"
# ..$ x   :List of 2
# .. ..$ x   : 'ident' chr "tmp_cars_sdf"
# .. ..$ vars: chr [1:2] "speed" "dist"
# .. ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
# ..$ dots:List of 2
# .. ..$ : language ~speed
# .. .. ..- attr(*, ".Environment")=<environment: 0x000000002556b260> 
#     .. ..$ : language ~dist
# .. .. ..- attr(*, ".Environment")=<environment: 0x000000002556b260> 
#     ..$ args:List of 1
# .. ..$ .by_group: logi FALSE
# ..- attr(*, "class")= chr [1:3] "op_arrange" "op_single" "op"
# - attr(*, "class")= chr [1:5] "tbl_PostgreSQLConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...

It is clear that adding ordering via arrange actually modifies the structure of the R object, due to the fact that remote tables have no way of being intrinsically ordered (or grouped), order and grouping info must be stored locally and is only transmitted when the final query is built.

Thus, the work-around uses three tricks: first, produce a temporary table using compute(). Note that doing this does not reset groups and ordering locally. Second, use Simon's trick to pull the simple select query corresponding to this new table, and overwrite the existing table structure so that all grouping and ordering info is lost. In order to preserve groups, the function re-adds the original groups to this table.

Why is this useful?

While the example provided in the OP serves to show the problem, the reason why it arose is because of mutates that depend on some (grouped) ordering on the table. Once the new columns are built, the old ordering is no longer necessary and is in fact sometimes a hindrance due to the linked issue on github. Such an example is given below:

tmp_cars_sdf <-
    copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)


cars_df <-
    cars %>%
    arrange(speed, dist) %>%
    group_by(speed) %>%
    mutate(diff_dist_up = dist - lag(dist)) %>%
    arrange(speed, desc(dist)) %>%
    mutate(diff_dist_down = dist - lag(dist)) %>%
    ungroup() %>%
    arrange(speed, dist) %>%
    data.frame()

So that:

head(cars_df)
# speed dist diff_dist_up diff_dist_down
# 1     4    2           NA             -8
# 2     4   10            8             NA
# 3     7    4           NA            -18
# 4     7   22           18             NA
# 5     8   16           NA             NA
# 6     9   10           NA             NA

With the new function we can replicate this remotely:

cars_df_2 <-
    tmp_cars_sdf %>%
    arrange(speed, dist) %>%
    group_by(speed) %>%
    mutate(diff_dist_up = dist - lag(dist)) %>%
    # unfortunately the next line is needed
    # because of https://github.com/tidyverse/dbplyr/issues/345
    unarrange() %>%
    arrange(speed, desc(dist)) %>%
    mutate(diff_dist_down = dist - lag(dist)) %>%
    ungroup() %>%
    unarrange() %>%
    collect() %>%
    arrange(speed, dist) %>%
    data.frame()

And checking, we see:

identical(cars_df, cars_df_2)
# [1] TRUE

Possible issues with this fix

The first issue is the necessity to call compute() which uses resources. The second issue is that it must be possible to modify the structure of the R object that encodes the remote table, but I don't know how queries are built from this structure so am unable to do it.

Alex
  • 15,186
  • 15
  • 73
  • 127