0

I am trying to read a bunch of CSV files from Google Cloud Storage into pandas dataframes as explained in Read csv from Google Cloud storage to pandas dataframe

storage_client = storage.Client()

bucket = storage_client.bucket(bucket_name)
blobs = bucket.list_blobs(prefix=prefix)

list_temp_raw = []
for file in blobs:
    filename = file.name
    temp = pd.read_csv('gs://'+bucket_name+'/'+filename+'.csv', encoding='utf-8')
list_temp_raw.append(temp)

df = pd.concat(list_temp_raw)

It shows the following error message while importing gcfs. The packages 'dask' and 'gcsfs' have already been installed on my machine; however, cannot get rid of the following error.

File "C:\Program Files\Anaconda3\lib\site-packages\gcsfs\dask_link.py", line 
121, in register
dask.bytes.core._filesystems['gcs'] = DaskGCSFileSystem
AttributeError: module 'dask.bytes.core' has no attribute '_filesystems'
llompalles
  • 3,072
  • 11
  • 20
user3000538
  • 189
  • 1
  • 2
  • 14

3 Answers3

2

It seems there is some error or conflict between the gcsfs and dask packages. In fact, the dask library is not needed for your code to work. The minimal configuration for your code to run is to install the libraries ( I am posting its latest versions):

google-cloud-storage==1.14.0
gcsfs==0.2.1
pandas==0.24.1

Also, the filename already contains the .csv extension. So change the 9th line to this:

temp = pd.read_csv('gs://' + bucket_name + '/' + filename, encoding='utf-8')

With this changes I ran your code and it works. I suggest you to create a virtual env and install the libraries and run the code there:

llompalles
  • 3,072
  • 11
  • 20
1

This has been tested and seen to work from elsewhere - whether reading directly from GCS or via Dask. You may wish to try import of gcsfs and dask, see if you can see the _filesystems and see its contents

In [1]: import dask.bytes.core

In [2]: dask.bytes.core._filesystems
Out[2]: {'file': dask.bytes.local.LocalFileSystem}

In [3]: import gcsfs

In [4]: dask.bytes.core._filesystems
Out[4]:
{'file': dask.bytes.local.LocalFileSystem,
 'gcs': gcsfs.dask_link.DaskGCSFileSystem,
 'gs': gcsfs.dask_link.DaskGCSFileSystem}

As of https://github.com/dask/gcsfs/pull/129 , gcsfs behaves better if it is unable to register itself with Dask, so updating may solve your problem.

mdurant
  • 27,272
  • 5
  • 45
  • 74
0

Few things to point out in the text above: bucket_name and prefixes needed to be defined. and the iteration over the filenames should append the each dataframe each time. Otherwise it is the last one that gets concatenated.

from google.cloud import storage
import pandas as pd

storage_client = storage.Client()

buckets_list = list(storage_client.list_buckets())
bucket_name='my_bucket'
bucket = storage_client.bucket(bucket_name)
blobs = bucket.list_blobs()

list_temp_raw = []
for file in blobs:
    filename = file.name
    temp = pd.read_csv('gs://'+bucket_name+'/'+filename, encoding='utf-8')
    print(filename, temp.head())
    list_temp_raw.append(temp)

df = pd.concat(list_temp_raw)
FerhatSF
  • 1
  • 1