Not a complete answer, but the details of my research so far to save people from retracing my steps.
Timeout does not appear controllable via dbplyr
dbconnect
accepts both the driver and arguments to pass to the driver (documentation): dbConnect(drv, ...)
.
- For some connection types the additional arguments can include
timeout
. This issue has an example using Cassandra: con <- dbConnect(odbc::odbc(), "Cassandra (DSN)", timeout = 10)
.
- However timeout does not appear to be supported as an argument for
bigquery
. The documentation lists the following arguments (project
, dataset
, billing
, page_size
, quiet
, use_legacy_sql
, bigint
) and notes that other arguments are currently ignored.
So given the above, it seems unlikely that the timeout can be controlled by R via dbplyr
, DBI
or the connection.
Splitting the query into multiple shorter queries
While not a preference of OP (comments make it clear) this is still a potential solution. I use an approach of filtering based on an unique ID column, with a wrapper function to reduce some of the additional clutter:
reconnect <- function(jj){
if(exists("connection"))
dbDisconnect(connection) # avoids multiple concurrent connections
print(paste0(as.character(Sys.time()),"-- modulo ",jj," begun")) # track progress
connection <- dbConnect(
bigrquery::bigquery(),
project = "your-gcp-project-id",
dataset = "dataset-name",
billing = billing
)
mytable <- tbl(connection, "mytable") %>%
filter(unique_id %% NUM_SUBSETS == jj) # filter to subset, requires unique_id
# assignment into the parent environment
assign("connection", connection, envir = parent.frame())
assign("mytable ", mytable , envir = parent.frame())
}
We then iterate as follows:
## parameters
DEVELOPMENT_MODE = FALSE
NUM_SUBSETS = 50
## subset
modulo = if(DEVELOPMENT_MODE){ modulo = 0 # only one if development mode
} else { modulo = 0:(NUM_SUBSETS-1) # otherwise all of them
}
results = data.frame()
for(jj in modulo){
reconnect(jj)
these_results = mytable %>%
-- some heavy dplyr wrangling --
%>% collect()
results = rbind(results, these_results)
}
I set DEVELOPER_MODE
to true when testing/developing, and to false when I want the entire thing to run.
Other avenues to consider
- Check whether timeout can be set/controlled within bigquery account (if it can not be controlled via R).
- Investigation of how complex
-- heavy dplyr wrangling here --
is. Because dbplyr does not translate very efficient sql code, in my work on SQL server, saving intermediate tables has cut hours off my runtimes. Given that downloading 10GB should be much faster than several hours, the bottleneck could be bigquery conducting all the wrangling on the fly (and that the initial 20 second execution is with lazy evaluation). This link suggests there is a six hour limit on the duration of a single execution.