3
from google.cloud import bigquery as bq
import google_auth_oauthlib.flow

query = '''select ... from ...'''

bigquery_client = bq.Client()
table = bq.query.QueryResults(query=query,client=bigquery_client)
table.use_legacy_sql = False
table.run()

# transfer bigquery data to pandas dataframe
columns=[field.name for field in table.schema]
rows = table.fetch_data()
data = []
for row in rows:
    data.append(row)

df = pd.DataFrame(data=data[0],columns=columns)

I want to load more than 10 million rows into python and it worked fine a few weeks ago, but now it only returns 100,000 rows. Anyone knows a reliable way to do this?

vortex
  • 79
  • 2
  • 8
  • I also tried async_query.py, and played with rows = query_job.results().fetch_data(max_results=1000000). But it seems like they put a cap of 100,000 limit on it somewhere. Is there a way to overwrite the cap? or more efficient way to perform bigquery to python calculation. – vortex Aug 15 '17 at 17:21
  • just wondering, have you run this query in your WebUI or CLI to see if it returns the total rows you expect? – Willian Fuks Aug 15 '17 at 17:53
  • I have run in my CLI, the rows is only 100,000. So the cutoff could be either at the table.run() or table.fetch_data(). – vortex Aug 15 '17 at 17:55
  • if the CLI is also returning 100k then as it seems that's actually all you have in your table. Looks like the issue is in your table and not some threshold being hit when bringing the data. – Willian Fuks Aug 15 '17 at 18:11
  • I ran the same query in the UI, it returns more than 39 million. But with python program, it's harder to diagnose where the cutoff occurred. – vortex Aug 15 '17 at 18:14

1 Answers1

3

I just tested this code here and could bring 3 million rows with no caps being applied:

import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/to/key.json'

from google.cloud.bigquery import Client

bc = Client()
query = 'your query'

job  = bc.run_sync_query(query)
job.use_legacy_sql = False
job.run()

data = list(job.fetch_data())

Does it work for you?

Willian Fuks
  • 11,259
  • 10
  • 50
  • 74
  • I can run your code without a problem. But len(data[0]) is still a list of tuple with 100K items, data[1] is the row count, which is 39 million, and data[2] is a string. Is this also your data structure looks like? – vortex Aug 15 '17 at 19:02
  • ah I see. It looks like you are using an old version of the BQ Client. I recommend using the version `0.26.0`. You can see which version you are using by running: `from google.cloud.bigquery import __version__;print(__version__)` – Willian Fuks Aug 16 '17 at 08:00
  • Yes, you are right, the version is probably downgraded related with other installations. Now it takes long time to load the table in. I am in the process of finding out an efficient workflow dealing with a lot of data. Do you have any suggestions? – vortex Aug 16 '17 at 12:46
  • Yeah bringing 40 million rows to a single instance is quite expensive. It really depends on what you want to do. What I recommend is trying to use dataflow implemented in [apache beam](https://beam.apache.org/) or have some cluster to run your analyzes, such as [dataproc](https://cloud.google.com/dataproc/). For the last I have a jupyter integrated with the master cluster and find it really useful for everyday analyzes on data. – Willian Fuks Aug 16 '17 at 14:04
  • Could you please give me more implementation detail for your jupyter workflow? Do you use datalab? For the big data interaction, do you set up a cluster and directly bring in the bigquery data to memory? Any documentation links would be appreciated! – vortex Aug 16 '17 at 14:19
  • I couldn't use datalab yet because my user got denied access due a mistake and now I need to wait to receive access again...For the data management, actually I rarely work with big data. First I work with small data and just when the model is ready is that I go for bigger volumes. Still, I do have a very small dataproc cluster with jupyter so to run general analyzes and build some algorithms. For the setup, I followed these steps: https://cloud.google.com/dataproc/docs/concepts/init-actions And usually I bring data from BQ to GCS. – Willian Fuks Aug 16 '17 at 14:29
  • But BQ -> GCS only allow exporting small files, and with 100GB data it would break into multiple pieces. do you just save the data into .csv format and read it from python? Last time I tried pd.read_csv('gs://...'), it doesn't seem to work. – vortex Aug 16 '17 at 14:35
  • I run a `extract` job with compression 'GZIP' to a bucket which does indeed break the result into several files. In dataproc I just load the files like `train_df = spark.read.csv('gs://BUCKET_NAME/FOLDER_NAME/train_data*.gz', header=True)` – Willian Fuks Aug 16 '17 at 14:38
  • Great! Thanks a lot! – vortex Aug 16 '17 at 14:42