9

When using bigrquery from R, the tidyverse API asks for authentication, which, once provided, allows bigquery to be accessed from the R session. Downloading results can be time consuming, and the if the oauth token expires mid transfer the download fails:

Error: Invalid Credentials [authError]

Example code

# install.packages("tidyverse"); install.packages('bigrquery')
Sys.setenv(BIGQUERY_TEST_PROJECT="your-gcp-project-id") 
library(tidyverse)
library(dbplyr)
library(bigrquery)

billing <- bq_test_project()

connection <- dbConnect(
  bigrquery::bigquery(),
  project = "your-gcp-project-id",
  dataset = "dataset-name",
  billing = billing
)

mytable <- tbl(connection, "mytable")

mytable %>%
  -- some heavy dplyr wrangling --
  %>% collect()

Which returns

Running job [/] 20s
Complete
Billed 400GB
Downloading 230,000,000 rows
Downloading data [=====--------] 28% ETA: 24h

but after some time

Error: Invalid Credentials [authError]

Question

How can the length of time before the 0Auth token expires be increased?

stevec
  • 41,291
  • 27
  • 223
  • 311
  • Hours to download 10GB seems very slow compared to comsumer internet speeds. Can you confirm that your R memory is large enough to fit the entire 10GB table into memory? Otherwise R will either have hung due to lack of memory, or is writing to and from disk, which is causing additional delay. – Simon.S.A. May 11 '20 at 02:54
  • @Simon.S.A. I agree about the speed, it's slow, but I've tried it many times with the same outcome: bigquery runs very fast, but downloading the resulting dataset is slow. I can confirm no problem with memory (30gb instance with nothing else running, data extract is 15gb max). I wondered whether it's writing to disk. I know a) the data starts as parquet in bigquery, it's queried by bigquery, then the results magically appear in the R session. Short answer: I don't know how they get there (if it's directly to ram or if they're written to disk first) – stevec May 11 '20 at 02:57
  • The DBI and bigrquery documentation suggests setting a timeout parameter when you connect to bigquery via dbplyr is not supported. There might be a setting on the bigquery end (i.e. not via R) but that is beyond my experience. – Simon.S.A. May 11 '20 at 03:12
  • If you have unique ID numbers in your query/table, then I can offer a workaround I use. – Simon.S.A. May 11 '20 at 03:13
  • @Simon.S.A. I think I know what you’re going to suggest, to split the query up into multiple shorter queries and authenticate in between each to avoid the token expiry issue? I had to do that as a hacky workaround but it would be awful to have to do that every time – stevec May 11 '20 at 03:15
  • That is the essence of my work-around - so I won't post it unless you think a different implementation might be insightful and request it. – Simon.S.A. May 11 '20 at 03:49
  • @Simon.S.A. I very much appreciate it, you could still post it in case others have a similar problem and prefer the work around? As the R/bigquery pattern is one I’ll use a lot I will aim for a simpler solution. I hope there’s a token timeout parameter i can set once somewhere – stevec May 11 '20 at 03:54

2 Answers2

3

I also have the same error and I get a timeout of the issue.

Another potential solution is exporting to google data studio and then download it through a csv as a work-around.

Or for large datasets, this is best done via extracting the BigQuery result to Google Cloud Storage, then downloading the data from there, as per this article: https://mran.microsoft.com/web/packages/bigQueryR/vignettes/bigQueryR.html

    ## Create the data extract from BigQuery to Cloud Storage
    job_extract <- bqr_extract_data("your_project",
                                    "your_dataset",
                                    "bigResultTable",
                                    "your_cloud_storage_bucket_name")

    ## poll the extract job to check its status
    ## its done when job$status$state == "DONE"
    bqr_get_job("your_project", job_extract$jobReference$jobId)

    ## to download via a URL and not logging in via Google Cloud Storage interface:
    ## Use an email that is Google account enabled
    ## Requires scopes:
    ##  https://www.googleapis.com/auth/devstorage.full_control
    ##  https://www.googleapis.com/auth/cloud-platform
    ## set via options("bigQueryR.scopes") and reauthenticate if needed

    download_url <- bqr_grant_extract_access(job_extract, "your@email.com")

    ## download_url may be multiple if the data is > 1GB
    > [1] "https://storage.cloud.google.com/big-query-r-extracts/extract-20160311112410-000000000000.csv"
    > [2] "https://storage.cloud.google.com/big-query-r-extracts/extract-20160311112410-000000000001.csv"
    > [3] "https://storage.cloud.google.com/big-query-r-extracts/extract-20160311112410-000000000002.csv"
SternK
  • 11,649
  • 22
  • 32
  • 46
Hao Zhang
  • 331
  • 1
  • 5
2

Not a complete answer, but the details of my research so far to save people from retracing my steps.

Timeout does not appear controllable via dbplyr

  • dbconnect accepts both the driver and arguments to pass to the driver (documentation): dbConnect(drv, ...).
  • For some connection types the additional arguments can include timeout. This issue has an example using Cassandra: con <- dbConnect(odbc::odbc(), "Cassandra (DSN)", timeout = 10).
  • However timeout does not appear to be supported as an argument for bigquery. The documentation lists the following arguments (project, dataset, billing, page_size, quiet, use_legacy_sql, bigint) and notes that other arguments are currently ignored.

So given the above, it seems unlikely that the timeout can be controlled by R via dbplyr, DBI or the connection.

Splitting the query into multiple shorter queries

While not a preference of OP (comments make it clear) this is still a potential solution. I use an approach of filtering based on an unique ID column, with a wrapper function to reduce some of the additional clutter:

reconnect <- function(jj){
  if(exists("connection"))
    dbDisconnect(connection) # avoids multiple concurrent connections

  print(paste0(as.character(Sys.time()),"-- modulo ",jj," begun")) # track progress

  connection <- dbConnect(
    bigrquery::bigquery(),
    project = "your-gcp-project-id",
    dataset = "dataset-name",
    billing = billing
  )

  mytable <- tbl(connection, "mytable") %>%
    filter(unique_id %% NUM_SUBSETS == jj) # filter to subset, requires unique_id

  # assignment into the parent environment
  assign("connection", connection, envir = parent.frame())
  assign("mytable ", mytable , envir = parent.frame())
}

We then iterate as follows:

## parameters
DEVELOPMENT_MODE = FALSE
NUM_SUBSETS = 50

## subset
modulo = if(DEVELOPMENT_MODE){ modulo = 0 # only one if development mode
} else { modulo = 0:(NUM_SUBSETS-1) # otherwise all of them
}

results = data.frame()

for(jj in modulo){
  reconnect(jj)

  these_results = mytable %>%
    -- some heavy dplyr wrangling --
    %>% collect()

  results = rbind(results, these_results)
}

I set DEVELOPER_MODE to true when testing/developing, and to false when I want the entire thing to run.

Other avenues to consider

  • Check whether timeout can be set/controlled within bigquery account (if it can not be controlled via R).
  • Investigation of how complex -- heavy dplyr wrangling here -- is. Because dbplyr does not translate very efficient sql code, in my work on SQL server, saving intermediate tables has cut hours off my runtimes. Given that downloading 10GB should be much faster than several hours, the bottleneck could be bigquery conducting all the wrangling on the fly (and that the initial 20 second execution is with lazy evaluation). This link suggests there is a six hour limit on the duration of a single execution.
Simon.S.A.
  • 6,240
  • 7
  • 22
  • 41