Some variants of the question have been answered here and here, which I've successfully used
Nevertheless, I have a slightly different problem. I've exported 1GB of data using BigQuery into Google Storage. This export is split between 5 csv files, with each dataset containing column names (I think this is what causing things to break).
The code that I have is:
# Run import
import pandas as pd
import numpy as np
from io import BytesIO
# Grab the file from the cloud storage
variable_list = ['part1', 'part2','part3','part4','part5']
for variable in variable_list:
file_path = "gs://[Bucket-name]/" + variable + ".csv"
%gcs read --object {file_path} --variable byte_data
# Read the dataset
data = pd.read_csv(BytesIO(byte_data), low_memory=False)
However, when I call len(data)
I don't get the full amount of rows back. The above code only seems to load 1 file.
I can load 5 different data frames and simply combine them in pandas by data=[df1, df2, df3, df4, df5]
but it seems very ugly.