1

Some variants of the question have been answered here and here, which I've successfully used

Nevertheless, I have a slightly different problem. I've exported 1GB of data using BigQuery into Google Storage. This export is split between 5 csv files, with each dataset containing column names (I think this is what causing things to break).

The code that I have is:

# Run import
import pandas as pd
import numpy as np
from io import BytesIO

# Grab the file from the cloud storage
variable_list = ['part1', 'part2','part3','part4','part5']
for variable in variable_list:
  file_path = "gs://[Bucket-name]/" + variable + ".csv"
  %gcs read --object {file_path} --variable byte_data

# Read the dataset
data = pd.read_csv(BytesIO(byte_data), low_memory=False)

However, when I call len(data) I don't get the full amount of rows back. The above code only seems to load 1 file.

I can load 5 different data frames and simply combine them in pandas by data=[df1, df2, df3, df4, df5] but it seems very ugly.

GRS
  • 2,807
  • 4
  • 34
  • 72
  • My initial thought is that `byte_data` is being overwritten in each iteration. Could you create another python variable to store the full contents(where you can append `byte_data` after each iteration) ? – Anthonios Partheniou Nov 29 '17 at 20:40
  • @AnthoniosPartheniou type(byte_data) returns that it's a bytes object. But if I create the empty bytes object full_data = bytes() , it doesn't have append. I tried changing full_data to list but I get: 'NoneType' object has no attribute 'append' – GRS Nov 30 '17 at 10:24
  • Try using `bytearray` or alternatively search for 'byte concatenation' . – Anthonios Partheniou Nov 30 '17 at 10:48

1 Answers1

0

I have found this and adopted for my cases. I run over all files in bucket (folder):

from google.datalab import Context
import google.datalab.storage as storage
import pandas as pd

try:
    from StringIO import StringIO
except ImportError:
    from io import BytesIO as StringIO

bucket_folder = 'ls_w'

df = pd.DataFrame()                # Final dataframe
for obj in bucket.objects():       # loop in all objects of the bucket
    if '/' not in obj.key:           # add other options to exclude other files
                                     # in this case it looks only at bucket level
                                     # not into subfolders!
        fn = obj.key                   # created file name variable (optional)
        print(obj.key)

        bites = 'gs://%s/%s' % (bucket_folder, fn)
        %gcs read --object $bites --variable data

        tdf = pd.read_csv(StringIO(data))  # read 

        df = pd.concat([df, tdf])          # concatenate results
Lukasz
  • 269
  • 4
  • 12