setseed()
needs to be within a query; while it doesn't necessarily make full sense in one way (since it returns null/NA
), it's at least clear. We can include it in its own "query".
A quick helper function, for convenience:
use_setseed <- function(tab, seed = 0.5) {
ign <- tab |>
summarize(a = setseed(seed)) |>
head(n = 1) |>
collect()
invisible(NULL)
}
An important note about this is that it must "realize" the query (typically collect()
it) in order for the setseed()
call to be actually executed. Since we need to realize it, but we don't need any of its data, I reduce the data passed back by "summarizing" (one row, one column), then collecting it, then discarding it with invisible()
.
Also, this is working on a more-direct connection,
duck <- DBI::dbConnect(duckdb::duckdb())
DBI::dbWriteTable(duck, "mtcars", mtcars)
mtcars_tbl <- tbl(duck, "mtcars")
From here, we just need to call use_setseed
immediately before our randomizing query.
use_setseed(mtcars_tbl)
mtcars_tbl |>
dplyr::mutate(fold = ceiling(3 * random())) |>
dplyr::summarize(avg_hp = mean(hp), .by = c(cyl, fold) )
# # Source: SQL [9 x 3]
# # Database: DuckDB 0.8.1 [r2@Linux 6.2.0-27-generic:R 4.2.3/:memory:]
# cyl fold avg_hp
# <dbl> <dbl> <dbl>
# 1 6 1 110
# 2 4 1 83.5
# 3 8 3 210
# 4 6 3 114
# 5 4 3 79.7
# 6 6 2 149
# 7 8 1 174
# 8 8 2 252.
# 9 4 2 97
# validation
use_setseed(mtcars_tbl)
res1 <- mtcars_tbl |>
dplyr::mutate(fold = ceiling(3 * random())) |>
dplyr::summarize(avg_hp = mean(hp), .by = c(cyl, fold) ) |>
dplyr::collect()
resn <- replicate(10, {
use_setseed(mtcars_tbl)
mtcars_tbl |>
dplyr::mutate(fold = ceiling(3 * random())) |>
dplyr::summarize(avg_hp = mean(hp), .by = c(cyl, fold) ) |>
dplyr::collect()
}, simplify=FALSE)
sapply(resn, identical, res1)
# [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Equivalently, if you have a duckdb connection object, we can reduce the bandwidth a little more by using this version of the function:
use_setseed2 <- function(con, seed=0.5) {
DBI::dbExecute(con, "select setseed(?) as ign", params = list(seed))
invisible(NULL)
}
And calling it with the duckdb
-connection object, as in
use_setseed2(duck) # note 'duck' and not 'mtcars_tbl'
mtcars_tbl |>
dplyr::mutate(fold = ceiling(3 * random())) |>
dplyr::summarize(avg_hp = mean(hp), .by = c(cyl, fold) )
# same as above
# validation
use_setseed2(duck)
res1 <- mtcars_tbl |>
dplyr::mutate(fold = ceiling(3 * random())) |>
dplyr::summarize(avg_hp = mean(hp), .by = c(cyl, fold) ) |>
dplyr::collect()
resn <- replicate(10, {
use_setseed2(duck)
mtcars_tbl |>
dplyr::mutate(fold = ceiling(3 * random())) |>
dplyr::summarize(avg_hp = mean(hp), .by = c(cyl, fold) ) |>
dplyr::collect()
}, simplify=FALSE)
sapply(resn, identical, res1)
# [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE