Loading .txt file from Google Cloud Storage into a Pandas DF

Question

I'm trying to load a .txt file from a GCS bucket into pandas df via pd.read_csv. When I run this code on my local machine (sourcing the .txt file from a local directory), it works perfectly. However, when I try and run the code in a cloud function , accessing the same .txt file but from a GCS bucket, I get a 'TypeError: cannot use a string pattern on a bytes-like object'

The only thing that's different is the fact that I'm accessing the .txt file via the GCS bucket so its a bucket object (Blob) instead of a normal file. Would I need to download the blob as a string or as a file-like object first before doing pd.read_csv? code is below

def stage1_cogs_vfc(data, context):  

    from google.cloud import storage
    import pandas as pd
    import dask.dataframe as dd
    import io
    import numpy as np


    start_bucket = 'my_bucket'   
    storage_client = storage.Client()
    source_bucket = storage_client.bucket(start_bucket)

    df = pd.DataFrame()

    file_path = 'gs://my_bucket/SCE_Var_Fact_Costs.txt'
    df = pd.read_csv(file_path,skiprows=12, encoding ='utf-8', error_bad_lines= False, warn_bad_lines= False , header = None ,sep = '\s+|\^+',engine='python')

Traceback (most recent call last):

 File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 383, in run_background_function _function_handler.invoke_user_function(event_object) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 217, in invoke_user_function return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 214, in call_user_function event_context.Context(**request_or_event.context)) File "/user_code/main.py", line 20, in stage1_cogs_vfc df = pd.read_csv(file_path,skiprows=12, encoding ='utf-8', error_bad_lines= False, warn_bad_lines= False , header = None ,sep = '\s+|\^+',engine='python') File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 702, in parser_f return _read(filepath_or_buffer, kwds) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 429, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 895, in __init__ self._make_engine(self.engine) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1132, in _make_engine self._engine = klass(self.f, **self.options) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2238, in __init__ self.unnamed_cols) = self._infer_columns() File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2614, in _infer_columns line = self._buffered_line() File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2689, in _buffered_line return self._next_line() File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2791, in _next_line next(self.data) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2379, in _read yield pat.split(line.strip()) TypeError: cannot use a string pattern on a bytes-like object
``|

score 1 · Accepted Answer · answered Jul 15 '19 at 23:13

1

I found a similar situation here.

I also noticed that on the line:

source_bucket = storage_client.bucket(source_bucket)

you are using "source_bucket" for both: your variable name and parameter. I would suggest to change one of those.

However, I think you'd like to see this doc for any further question related to the API itself: Storage Client - Google Cloud Storage API

answered Jul 15 '19 at 23:13

Kevin Quinzel

1,430
1
13
23

thanks for pointing that out - I have updated my code but the problem still persists – jwlon81 Jul 16 '19 at 06:16
Did you try the steps on the first link? I also find something similar over here: https://stackoverflow.com/a/21546823/7674214 – Kevin Quinzel Jul 16 '19 at 22:02
Interestingly, I managed to get it reading into a DASK df ( via the link you provided) but when I go to more pandas-based ETL operations (like iloc etc..), dask can't do these. Your suggestion was useful though - thankyou. – jwlon81 Jul 16 '19 at 22:32
actually... I read in via dask then converted the dask df to a pandas df (via` df = daskdf.compute()` in order to do more ETL heavy lifting. Works perfectly. – jwlon81 Jul 17 '19 at 02:15

score 1 · Answer 2 · answered Jul 17 '19 at 02:20

Building on points from @K_immer is my updated code that includes reading into 'Dask' df...

def stage1_cogs_vfc(data, context):  

    from google.cloud import storage
    import pandas as pd
    import dask.dataframe as dd
    import io
    import numpy as np
    import datetime as dt


    start_bucket = 'my_bucket'
    destination_path = 'gs://my_bucket/ddf-*_cogs_vfc.csv'

    storage_client = storage.Client()
    bucket = storage_client.get_bucket(start_bucket)

    blob = bucket.get_blob('SCE_Var_Fact_Costs.txt')

    df0 = pd.DataFrame()

    file_path = 'gs://my_bucket/SCE_Var_Fact_Costs.txt'
    df0 = dd.read_csv(file_path,skiprows=12, dtype=object ,encoding ='utf-8', error_bad_lines= False, warn_bad_lines= False , header = None ,sep = '\s+|\^+',engine='python')


    df7 = df7.compute() # converts dask df to pandas df

    # then do your heavy ETL stuff here using pandas...

Loading .txt file from Google Cloud Storage into a Pandas DF

2 Answers2