5

How can I use duckdb's setseed() function (see reference doc) with dplyr syntax to make sure the analysis below is reproducible?

# dplyr version 1.1.1
# arrow version 11.0.0.3
# duckdb 0.7.1.1
out_dir <- tempfile()
arrow::write_dataset(mtcars, out_dir, partitioning = "cyl")

mtcars_ds <- arrow::open_dataset(out_dir)

mtcars_smry <- mtcars_ds |>
  arrow::to_duckdb() |>
  dplyr::mutate(
    fold = ceiling(3 * random())
  ) |>
  dplyr::summarize(
    avg_hp = mean(hp),
    .by = c(cyl, fold)
  )

mtcars_smry |>
  dplyr::collect()
#> Warning: Missing values are always removed in SQL aggregation functions.
#> Use `na.rm = TRUE` to silence this warning
#> This warning is displayed once every 8 hours.
#> # A tibble: 9 × 3
#>     cyl  fold avg_hp
#>   <int> <dbl>  <dbl>
#> 1     4     1   92  
#> 2     4     3   82.3
#> 3     4     2   74.5
#> 4     8     2  183. 
#> 5     8     3  210  
#> 6     8     1  300. 
#> 7     6     3  110  
#> 8     6     1  117  
#> 9     6     2  175

Created on 2023-08-27 with reprex v2.0.2

Ashirwad
  • 1,890
  • 1
  • 12
  • 14
  • 1
    Since I have been able to get this to work reliably with `DBI`-based connections but not (yet) after `arrow::to_duckdb()`, I suggest we need the [tag:arrow] and [tag:duckdb] tags, would you agree? – r2evans Aug 28 '23 at 12:12
  • I wonder if my inability to do with with `to_duckdb()` suggests that there is some reconnection and/or re-seeding going on in the background. No combination of `to_duckdb(con=, auto_disconnect=)` remedies the problem, but my more direct SQL-based answer suggests that the duckdb `setseed` is working conceptually, so something in the `to_duckdb` chain must be upsetting the flow. Thoughts? – r2evans Aug 28 '23 at 12:13
  • Found [this article](https://www.christophenicault.com/post/large_dataframe_arrow_duckdb/) when I was searching for an answer. You can couple your approach with the ideas presented in the linked article and hopefully it should work! – Ashirwad Aug 28 '23 at 12:37
  • I don't see anything in that article that suggests internals of the duckdb/arrow interaction that answers the problem: why is the seed honored when using `DBI` but not after `to_duckdb`? – r2evans Aug 28 '23 at 14:48
  • I think it's a matter of disposed connections and such. For instance, if I do `mutate(fold = ceiling(3 * coalesce(setseed(0.5), random())))` with my `mtcars_tbl` below, the first return value is different, but the second _and all subsequent_ (assuming the same query repeated over and over with no other queries in between) are all the same. I think the `setseed(.)` is not used in the current/immediate query but in the _next_ query of the current connection. I can't get any combination of `con=` or `auto_disconnect=` to not look like it's re-initiating the connection (and ergo the seed) – r2evans Aug 28 '23 at 14:51
  • 1
    @r2evans, do you think it would be worthwhile to open an issue on arrow's GitHub repo? – Ashirwad Aug 28 '23 at 15:31
  • 1
    Yes I do, where they will have much more awareness of the internals to know if this is a bug or how to explain what is flawed in our assumption of this. If it is resolved there, _please_ come back and either comment or provide another answer. – r2evans Aug 28 '23 at 15:57

1 Answers1

4

setseed() needs to be within a query; while it doesn't necessarily make full sense in one way (since it returns null/NA), it's at least clear. We can include it in its own "query".

A quick helper function, for convenience:

use_setseed <- function(tab, seed = 0.5) {
  ign <- tab |>
    summarize(a = setseed(seed)) |>
    head(n = 1) |>
    collect()
  invisible(NULL)
}

An important note about this is that it must "realize" the query (typically collect() it) in order for the setseed() call to be actually executed. Since we need to realize it, but we don't need any of its data, I reduce the data passed back by "summarizing" (one row, one column), then collecting it, then discarding it with invisible().

Also, this is working on a more-direct connection,

duck <- DBI::dbConnect(duckdb::duckdb())
DBI::dbWriteTable(duck, "mtcars", mtcars)
mtcars_tbl <- tbl(duck, "mtcars")

From here, we just need to call use_setseed immediately before our randomizing query.

use_setseed(mtcars_tbl)
mtcars_tbl |>
  dplyr::mutate(fold = ceiling(3 * random())) |>
  dplyr::summarize(avg_hp = mean(hp), .by = c(cyl, fold) )
# # Source:   SQL [9 x 3]
# # Database: DuckDB 0.8.1 [r2@Linux 6.2.0-27-generic:R 4.2.3/:memory:]
#     cyl  fold avg_hp
#   <dbl> <dbl>  <dbl>
# 1     6     1  110  
# 2     4     1   83.5
# 3     8     3  210  
# 4     6     3  114  
# 5     4     3   79.7
# 6     6     2  149  
# 7     8     1  174  
# 8     8     2  252. 
# 9     4     2   97  

# validation
use_setseed(mtcars_tbl)
res1 <- mtcars_tbl |>
  dplyr::mutate(fold = ceiling(3 * random())) |> 
  dplyr::summarize(avg_hp = mean(hp), .by = c(cyl, fold) ) |>
  dplyr::collect()
resn <- replicate(10, {
  use_setseed(mtcars_tbl)
  mtcars_tbl |>
    dplyr::mutate(fold = ceiling(3 * random())) |> 
    dplyr::summarize(avg_hp = mean(hp), .by = c(cyl, fold) ) |> 
    dplyr::collect()
}, simplify=FALSE)
sapply(resn, identical, res1)
#  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Equivalently, if you have a duckdb connection object, we can reduce the bandwidth a little more by using this version of the function:

use_setseed2 <- function(con, seed=0.5) {
  DBI::dbExecute(con, "select setseed(?) as ign", params = list(seed))
  invisible(NULL)
}

And calling it with the duckdb-connection object, as in

use_setseed2(duck) # note 'duck' and not 'mtcars_tbl'
mtcars_tbl |>
  dplyr::mutate(fold = ceiling(3 * random())) |> 
  dplyr::summarize(avg_hp = mean(hp), .by = c(cyl, fold) )
# same as above

# validation
use_setseed2(duck)
res1 <- mtcars_tbl |>
  dplyr::mutate(fold = ceiling(3 * random())) |> 
  dplyr::summarize(avg_hp = mean(hp), .by = c(cyl, fold) ) |>
  dplyr::collect()
resn <- replicate(10, {
  use_setseed2(duck)
  mtcars_tbl |>
    dplyr::mutate(fold = ceiling(3 * random())) |> 
    dplyr::summarize(avg_hp = mean(hp), .by = c(cyl, fold) ) |> 
    dplyr::collect()
}, simplify=FALSE)
sapply(resn, identical, res1)
#  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
r2evans
  • 141,215
  • 6
  • 77
  • 149