2

I have this script, and I would like to make it quicker, if possible. Reading a Parquet dataset to pandas.

import pandas as pd
from pyarrow.parquet import ParquetDataset
import s3fs
import pyarrow.parquet as pq

s3 = s3fs.S3FileSystem()
s3_path = 's3:// ... '
paths = [path for path in s3.ls(s3_path) if path.endswith(".parquet")]
dataset = ParquetDataset(paths, filesystem=s3)

Until here is very quick and it works well

but as working with parquet is not very flexible, I searched on SO how to make it in pandas and I found this:

table = dataset.read()
df = table.to_pandas()

Unfortunately, it takes hours to read 3 GB of parquet. I was wondering if there any tip/ trick to make it quicker and you could help me please?

Thank you very much in advance!

Andrew Tulip
  • 161
  • 6

1 Answers1

0

Is there any reason why you are using s3fs? If not, you could try to skip this intermediate step and work with pandas.read_parquet(). You can directly work on s3:// URLs.
Also, where do you run your script? Depending on this, you might try to first download the files before reading them in or increase your compute & memory.

/e: Actually, you can try one of the many approaches that have been offered as answers to this question

mc51
  • 1,883
  • 14
  • 28