Background:
I have very large amount of data (1500GB) in Google cloud BigQuery.
I'm trying to build a ML model using those data as training dataset. So I wrote the following code in a Jupyter notebook to fetch the dataset.
import pandas as pd
from google.cloud import bigquery
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = './my_credential.json'
client = bigquery.Client()
sql = """
SELECT
Feature1,
Feature2,
Feature3,
target
FROM dataset
"""
sql_result = client.query(
sql
)
sql_result.to_dataframe()
The problem:
The code throws Memory Error after 30 min of execution. I understand it is because the code tries to pull 1500GB data to my Jupyter notebook, but I don't know how to fix.
How do I train on this large amount of data using Jupyter notebook?