How to load multiple partition parquet files from GCS into pandas dataframe?

Question

I am trying to read multiple parquet files stored as partitions from google cloud storage and read them as 1 single pandas data frame. As an example, here is the folder structure at gs://path/to/storage/folder/

And inside each of the event_date=*, there are multiple parquet files

So the directory structure is something like this -

--gs://path/to/storage/folder/
   ---event_date=2023-01-01/
      ---abc.parquet
      ---def.parquet
   ---event_date=2023-01-02/
      ---ghi.parquet
      ---jkl.parquet

I want to load this to pandas data frame and I used below code

import pandas as pd
import gcsfs
from pyarrow import parquet

url = "gs://path/to/storage/folder/event_date=*/*" 
fs = gcsfs.GCSFileSystem()


files = ["gs://" + path for path in fs.glob(url)]
print(files)
data = parquet.ParquetDataset(files, filesystem=fs)
multiple_dates_df = data.read().to_pandas()
print(multiple_dates_df.shape)

But I get below error -

OSError: Passed non-file path: gs://path/to/storage/folder/event_date=2023-01-01/abc.parquet

How do I fix this?

can you have a look at this [code snippet](https://gist.github.com/lpillmann/fa1874c7deb8434ca8cba8e5a045dde2) — Sathi Aiswarya, Feb 22 '23 at 09:33
Hi @SathiAiswarya - it works when you have the gs_directory_path as 1 single parquet files but I want to load multiple parquet files. — Regressor, Feb 22 '23 at 15:04
[this answer](https://stackoverflow.com/a/72059752/3242418) solved the problem for me! — lunguini, Apr 27 '23 at 09:16

score 3 · Accepted Answer · answered Feb 23 '23 at 11:49

3

Seems it is not possible for pandas to read multiple parquet files stored under a gcs path,There is a bug raised for this at github, which is still open further progress can be tracked there.

answered Feb 23 '23 at 11:49

Sathi Aiswarya

2,068
2
11

How to load multiple partition parquet files from GCS into pandas dataframe?

1 Answers1

Linked