I am trying to read multiple parquet files stored as partitions from google cloud storage and read them as 1 single pandas data frame. As an example, here is the folder structure at gs://path/to/storage/folder/
And inside each of the event_date=*
, there are multiple parquet files
So the directory structure is something like this -
--gs://path/to/storage/folder/
---event_date=2023-01-01/
---abc.parquet
---def.parquet
---event_date=2023-01-02/
---ghi.parquet
---jkl.parquet
I want to load this to pandas data frame and I used below code
import pandas as pd
import gcsfs
from pyarrow import parquet
url = "gs://path/to/storage/folder/event_date=*/*"
fs = gcsfs.GCSFileSystem()
files = ["gs://" + path for path in fs.glob(url)]
print(files)
data = parquet.ParquetDataset(files, filesystem=fs)
multiple_dates_df = data.read().to_pandas()
print(multiple_dates_df.shape)
But I get below error -
OSError: Passed non-file path: gs://path/to/storage/folder/event_date=2023-01-01/abc.parquet
How do I fix this?