1

I am trying to start a machine learning study by using the table which is in the Google Big Query. Size of table is nearly 36 GB and the table has nearly 300 millions of rows with 6 columns. I tried the code below:

from google.colab import auth
auth.authenticate_user()
print('Authenticated')

from google.cloud import bigquery
import json
import requests
import os


from google.cloud import bigquery
from datetime import datetime

project_id = 'my_project_name'

client = bigquery.Client(project=project_id)

import os
os.environ["GOOGLE_CLOUD_PROJECT"] = project_id

def getClient():
    client = bigquery.Client()
    return client

def executeQuery(query):
    client = getClient()
    query_job = client.query(query).to_dataframe()
    return query_job


ml_query='''SELECT * FROM `my_table_name`'''

results=executeQuery(ml_query)

When I limit the table like ml_query='''SELECT * FROM my_table_name limit 1000''', The code runs. That's why, I think the problem is the size of data but I am sure that people is using bigger than this dataset. How can I handle this problem? I tried also to download as csv but I learned that the row limit of csv format is 1 million. Any solution including extracting the data and using pycharm or jupyter notebook instead of colab is okey for me. I have also 100 GB of Disk Space in colab. I appreciate if you can help.

bbgghh
  • 79
  • 1
  • 2
  • 9
  • I haven't used BigQuery so I could be completely wrong, but since you are using SQL, wouldn't it be possible to use pagination? [see example](https://www.sqlshack.com/pagination-in-sql-server/) – mahieyin-rahmun Apr 14 '21 at 20:44

1 Answers1

1

You should allow python client to use storage-api if you want to export large amount of data from BQ. Here is an example from documentation. (storage-api usage will have additional charge, $1.1 per TB)

2nd option, -prerequsite: having access on a gcs bucket-;

  1. submitting an export job to object storage.
  2. read to dataframe directly from gcs.
ktopcuoglu
  • 146
  • 1
  • 3