1

I'm trying to load a pickled pandas dataframe from Google Cloud Storage into App Engine.

I have been using blob.download_to_file() to read the bytestream into pandas, however I encounter the following error: UnpicklingError: invalid load key, m I have tried seeking to the beginning to no avail and am pretty sure something fundamental is missing from my understanding.

When attempting to pass an open file object and read from there, I get an UnsupportedOperation: write error

from io import BytesIO
from google.cloud import storage

def get_byte_fileobj(project, bucket, path) -> BytesIO:
    blob = _get_blob(bucket, path, project)
    byte_stream = BytesIO()
    blob.download_to_file(byte_stream)
    byte_stream.seek(0)
    return(byte_stream)

def _get_blob(bucket_name, path, project):
    credentials = service_account.Credentials.from_service_account_file(
        service_account_credentials_path) if service_account_credentials_path else None
    storage_client = storage.Client(project=project, credentials=credentials)
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(path)
    return(blob)

fileobj = get_byte_fileobj(projectid, 'backups', 'Matches/Matches.pickle')
pd.read_pickle(fileobj)

Ideally pandas would just read from pickle since all of my GCS backups are in that format, but I'm open to suggestions.

1 Answers1

1

The pandas.read_pickle() method takes as argument a file path string, not a file handler/object:

pandas.read_pickle(path, compression='infer') 
   Load pickled pandas object (or any object) from file.

path : str 
   File path where the pickled object will be loaded.

If you're in the 2nd generation standard or the flexible environment you could try to use a real /tmp file instead of BytesIO.

Otherwise you'd have to figure out another method of loading the data into pandas, which supports a file object/descriptor. In general the approach is described in How to restore Tensorflow model from Google bucket without writing to filesystem? (context is different, but same general idea)

Dan Cornilescu
  • 39,470
  • 12
  • 57
  • 97
  • Thank you for the concise explanation, as well as for your link to the similar problem you helped with. I had noticed others using the /tmp directory, but I was unaware of what exactly it was. Now that I'm aware of it, it opens many doors. – Arya Eshraghi Feb 05 '19 at 01:08