I have this script, and I would like to make it quicker, if possible. Reading a Parquet dataset to pandas.
import pandas as pd
from pyarrow.parquet import ParquetDataset
import s3fs
import pyarrow.parquet as pq
s3 = s3fs.S3FileSystem()
s3_path = 's3:// ... '
paths = [path for path in s3.ls(s3_path) if path.endswith(".parquet")]
dataset = ParquetDataset(paths, filesystem=s3)
Until here is very quick and it works well
but as working with parquet is not very flexible, I searched on SO how to make it in pandas and I found this:
table = dataset.read()
df = table.to_pandas()
Unfortunately, it takes hours to read 3 GB of parquet. I was wondering if there any tip/ trick to make it quicker and you could help me please?
Thank you very much in advance!