How to load large datasets to R from BigQuery?

Question

I have tried two ways with Bigrquery package such that

library(bigrquery)
library(DBI)

con <- dbConnect(
  bigrquery::bigquery(),
  project = "YOUR PROJECT ID HERE",
  dataset = "YOUR DATASET"
)
test<- dbGetQuery(con, sql, n = 10000, max_pages = Inf)

and

sql <- `YOUR LARGE QUERY HERE` #long query saved to View and its select here
tb <- bigrquery::bq_project_query(project, sql)
bq_table_download(tb, max_results = 1000)

but failing to the error "Error: Requested Resource Too Large to Return [responseTooLarge]", potentially related issue here, but I am interested in any tool to get the job done: I tried already the solutions outlined here but they failed.

How can I load large datasets to R from BigQuery?

Why downvote? I can very easily do the operation in other languages such as Python but with R, there seems to be no easy way for this, unless finding some sharding option or similar. — hhh, Sep 02 '18 at 14:56
That error specifically comes from Google/bigquery. Are you _sure_ you [read everything you should have in the docs](https://github.com/r-dbi/bigrquery/blob/a3ef603f36495bc13d9a5f5081010084bf3c146c/R/bq-download.R#L35-L37)? — hrbrmstr, Sep 03 '18 at 00:57
@hrbrmstr I see, I did not expect that the tool would not use batches or sharding for downloading like Pandas read_gbq in Python, I currently use the method mentioned for large data sets but it would be more convenient directly from R. Is it really so that there is no batch download option for R like read_gbq? — hhh, Sep 03 '18 at 06:41

score 2 · Answer 1 · answered Sep 03 '18 at 10:51

As @hrbrmstr kind of suggested you, the documentation mentions specifically:

> #' @param page_size The number of rows returned per page. Make this smaller
> #'   if you have many fields or large records and you are seeing a
> #'   'responseTooLarge' error.

In this documentation from r-project.org you will read a different advise in the explanation of this function (page 13):

This retrieves rows in chunks of page_size. It is most suitable for results of smaller queries (<100 MB, say). For larger queries, it is better to export the results to a CSV file stored on google cloud and use the bq command line tool to download locally.

score 2 · Answer 2 · answered Jul 22 '21 at 03:21

I see someone's created a way of making this easier. There's some setup involved, but then you can download using the Google Storage API like so:

## Auth is done automagically using Application Default Credentials.
## Use the following command once to set it up :
## gcloud auth application-default login --billing-project={project}
library(bigrquerystorage)

# TODO(developer): Set the project_id variable.
# project_id <- 'your-project-id'
#
# The read session is created in this project. This project can be
# different from that which contains the table.

rows <- bqs_table_download(
  x = "bigquery-public-data:usa_names.usa_1910_current"
  , parent = project_id
  # , snapshot_time = Sys.time() # a POSIX time
  , selected_fields = c("name", "number", "state"),
  , row_restriction = 'state = "WA"'
  # , as_tibble = TRUE # FALSE : arrow, TRUE : arrow->as.data.frame
)

sprintf("Got %d unique names in states: %s",
        length(unique(rows$name)),
        paste(unique(rows$state), collapse = " "))

# Replace bigrquery::bq_download_table
library(bigrquery)
rows <- bigrquery::bq_table_download("bigquery-public-data.usa_names.usa_1910_current")
# Downloading 6,122,890 rows in 613 pages.
overload_bq_table_download(project_id)
rows <- bigrquery::bq_table_download("bigquery-public-data.usa_names.usa_1910_current")
# Streamed 6122890 rows in 5980 messages.

score 1 · Answer 3 · answered Sep 10 '18 at 02:56

I just started using BigQuery too. I think it should be something like this.

The current bigrquery release can be installed from CRAN:

install.packages("bigrquery")

The newest development release can be installed from GitHub:

install.packages('devtools')
devtools::install_github("r-dbi/bigrquery")

Usage Low-level API

library(bigrquery)
billing <- bq_test_project() # replace this with your project ID 
sql <- "SELECT year, month, day, weight_pounds FROM `publicdata.samples.natality`"

tb <- bq_project_query(billing, sql)
#> Auto-refreshing stale OAuth token.
bq_table_download(tb, max_results = 10)

DBI

library(DBI)

con <- dbConnect(
  bigrquery::bigquery(),
  project = "publicdata",
  dataset = "samples",
  billing = billing
)
con 
#> <BigQueryConnection>
#>   Dataset: publicdata.samples
#>   Billing: bigrquery-examples

dbListTables(con)
#> [1] "github_nested"   "github_timeline" "gsod"            "natality"       
#> [5] "shakespeare"     "trigrams"        "wikipedia"

dbGetQuery(con, sql, n = 10)



library(dplyr)

natality <- tbl(con, "natality")

natality %>%
  select(year, month, day, weight_pounds) %>% 
  head(10) %>%
  collect()

I demonstrated this approach in the question but `"Error: Requested Resource Too Large to Return [responseTooLarge]"`, did you find any way to go around the error with larger datasets? — hhh, Sep 12 '18 at 10:42

score 0 · Answer 4 · answered Nov 18 '20 at 09:20

This did the trick for me.

# Make page_size some value greater than the default (10000)
x <- 50000

bq_table_download(tb, page_size=x)

Beware, if you set page_size to some arbitrarily high value (100000 in my case), you'll start seeing a lot of empty rows.

Still haven't found a good "rule of thumb" for what the correct page_size value should be for a given table size.

How to load large datasets to R from BigQuery?

4 Answers4

Linked