2

I am trying to read multiple parquet files stored as partitions from google cloud storage and read them as 1 single pandas data frame. As an example, here is the folder structure at gs://path/to/storage/folder/

enter image description here

And inside each of the event_date=*, there are multiple parquet files

So the directory structure is something like this -

--gs://path/to/storage/folder/
   ---event_date=2023-01-01/
      ---abc.parquet
      ---def.parquet
   ---event_date=2023-01-02/
      ---ghi.parquet
      ---jkl.parquet

I want to load this to pandas data frame and I used below code

import pandas as pd
import gcsfs
from pyarrow import parquet

url = "gs://path/to/storage/folder/event_date=*/*" 
fs = gcsfs.GCSFileSystem()


files = ["gs://" + path for path in fs.glob(url)]
print(files)
data = parquet.ParquetDataset(files, filesystem=fs)
multiple_dates_df = data.read().to_pandas()
print(multiple_dates_df.shape)

But I get below error -

OSError: Passed non-file path: gs://path/to/storage/folder/event_date=2023-01-01/abc.parquet

How do I fix this?

Regressor
  • 1,843
  • 4
  • 27
  • 67
  • can you have a look at this [code snippet](https://gist.github.com/lpillmann/fa1874c7deb8434ca8cba8e5a045dde2) – Sathi Aiswarya Feb 22 '23 at 09:33
  • Hi @SathiAiswarya - it works when you have the gs_directory_path as 1 single parquet files but I want to load multiple parquet files. – Regressor Feb 22 '23 at 15:04
  • [this answer](https://stackoverflow.com/a/72059752/3242418) solved the problem for me! – lunguini Apr 27 '23 at 09:16

1 Answers1

3

Seems it is not possible for pandas to read multiple parquet files stored under a gcs path,There is a bug raised for this at github, which is still open further progress can be tracked there.

Sathi Aiswarya
  • 2,068
  • 2
  • 11