Reading CSV files from Google Cloud Storage using pandas

Question

I am trying to read a bunch of CSV files from Google Cloud Storage into pandas dataframes as explained in Read csv from Google Cloud storage to pandas dataframe

storage_client = storage.Client()

bucket = storage_client.bucket(bucket_name)
blobs = bucket.list_blobs(prefix=prefix)

list_temp_raw = []
for file in blobs:
    filename = file.name
    temp = pd.read_csv('gs://'+bucket_name+'/'+filename+'.csv', encoding='utf-8')
list_temp_raw.append(temp)

df = pd.concat(list_temp_raw)

It shows the following error message while importing gcfs. The packages 'dask' and 'gcsfs' have already been installed on my machine; however, cannot get rid of the following error.

File "C:\Program Files\Anaconda3\lib\site-packages\gcsfs\dask_link.py", line 
121, in register
dask.bytes.core._filesystems['gcs'] = DaskGCSFileSystem
AttributeError: module 'dask.bytes.core' has no attribute '_filesystems'

score 2 · Accepted Answer · answered Mar 05 '19 at 13:24

2

It seems there is some error or conflict between the gcsfs and dask packages. In fact, the dask library is not needed for your code to work. The minimal configuration for your code to run is to install the libraries ( I am posting its latest versions):

google-cloud-storage==1.14.0
gcsfs==0.2.1
pandas==0.24.1

Also, the filename already contains the .csv extension. So change the 9th line to this:

temp = pd.read_csv('gs://' + bucket_name + '/' + filename, encoding='utf-8')

With this changes I ran your code and it works. I suggest you to create a virtual env and install the libraries and run the code there:

answered Mar 05 '19 at 13:24

llompalles

3,072
11
20

Thanks. I have all three libraries installed. Does it matter if using conda instead of pip to create a new environment? – user3000538 Mar 05 '19 at 13:57
I don't think it will make a difference. I hope it will work for you. – llompalles Mar 05 '19 at 14:30

score 1 · Answer 2 · answered Mar 04 '19 at 18:56

1

This has been tested and seen to work from elsewhere - whether reading directly from GCS or via Dask. You may wish to try import of gcsfs and dask, see if you can see the _filesystems and see its contents

In [1]: import dask.bytes.core

In [2]: dask.bytes.core._filesystems
Out[2]: {'file': dask.bytes.local.LocalFileSystem}

In [3]: import gcsfs

In [4]: dask.bytes.core._filesystems
Out[4]:
{'file': dask.bytes.local.LocalFileSystem,
 'gcs': gcsfs.dask_link.DaskGCSFileSystem,
 'gs': gcsfs.dask_link.DaskGCSFileSystem}

As of https://github.com/dask/gcsfs/pull/129 , gcsfs behaves better if it is unable to register itself with Dask, so updating may solve your problem.

answered Mar 04 '19 at 18:56

mdurant

27,272
5
45
74

Note: gcsfs does a fine job listing filenames on the remote storage, you should not need to use the google client in addition. – mdurant Mar 05 '19 at 14:43
Thanks for your helpful answer. However, in my case, importing either of gcsfs or dask throws out the above-mentioned AttributeError. It's weird as I have both installed on my windows machine. – user3000538 Mar 05 '19 at 17:07
Mine is 0.2.1 which is the latest one. – user3000538 Mar 06 '19 at 18:12
Can you try by pip-installing from github master? – mdurant Mar 06 '19 at 19:09

score 0 · Answer 3 · answered Aug 24 '20 at 18:21

Few things to point out in the text above: bucket_name and prefixes needed to be defined. and the iteration over the filenames should append the each dataframe each time. Otherwise it is the last one that gets concatenated.

from google.cloud import storage
import pandas as pd

storage_client = storage.Client()

buckets_list = list(storage_client.list_buckets())
bucket_name='my_bucket'
bucket = storage_client.bucket(bucket_name)
blobs = bucket.list_blobs()

list_temp_raw = []
for file in blobs:
    filename = file.name
    temp = pd.read_csv('gs://'+bucket_name+'/'+filename, encoding='utf-8')
    print(filename, temp.head())
    list_temp_raw.append(temp)

df = pd.concat(list_temp_raw)

Reading CSV files from Google Cloud Storage using pandas

3 Answers3

Linked