A Good Way To Go From A Pandas Python Dataframe, To Parquet Format, Snappy Compression, To S3

Question

I'm trying to go from a pandas data frame into AWS S3 in a specific format, but the main pain point of this for me is getting the data frame into a parquet format file. I've been trying to do this using pyarrow and I know there are specific functions to do this kind of thing. The closest it seems I've gotten to doing this is with code something along the lines of:

     import pyarrow

     buf = pyarrow.BufferOutputStream()
     dataframe.to_parquet(buf, compression=None)
     final = pyarrow.compress(buf, codec='snappy')
     s3_client.Object('bucket','key/file.parquet.snappy').put(Body=final.getvalue().to_pybytes())

This will give me a file in S3 but when I try to preview it in Athena or Databrew, it can't read it. I've tried getting rid of the snappy compression, but I got the same issue with the compression step removed, so I'm assuming there is an issue with the parquet format conversion, but I can't seem to find alternatives in python. Also using the snappy compression parameter on the to_parquet function didn't seem to compress it to the point I was expecting.

I've also tried converting the data frame to a table and using the write_table function follow by practically the same steps, but got the same conclusion.

My questions are, am I misunderstanding the process to do this using pyarrow? Is there a better alternative I'm not seeing? Is the conversion from parquet to bytes messing with the parquet format once it's placed in S3?

I just wrote an answer for another problem but will do the same thing you are looking for. Can you confirm if this helps https://stackoverflow.com/a/73752220/4326922 ? — Prabhakar Reddy, Sep 18 '22 at 03:18
That way works well to get it into S3 and Glue, but it doesn't compress, even with the compression parameter set to snappy. It also looks like wrangler uses pyarrow anyways so I'd assume it would have the same issues. Feels closer to a solution than where I was, so thank you! — Ayden Shepherd, Sep 19 '22 at 15:46
By default it uses snappy compression and you don't need to specify it explicitly. Is it not generating the snappy compressed files like e1e4a7570d714ca9ac722ff2dce13e72.snappy.parquet with the script I provided? — Prabhakar Reddy, Sep 20 '22 at 00:47
It generates it, but for the data I have it generates a roughly 4.4 KB file, while from AWS pulling in the exact same data, it generates a roughly 2.1 to 3.1 KB file, depending on where I'm generating from and file type. It's insignificant for the data I'm working with now, but there is other data that is gigabytes long that is important to have as compressed as possible. I plan on manually comparing the two parquet files manually asap, to 100% verify the same data, I'm still just confused how all of these different ways to convert to parquet give the same data different file sizes — Ayden Shepherd, Sep 21 '22 at 02:34
parquet is columnar and has lot of metadata written to it and when compression is applied it is expected some times same number of rows will be written to a different sizes based on actual values present. you can use s3 select to query these parquet with out any tool and compare between two files. Please do upvote if my answer from here helped https://stackoverflow.com/questions/73750110/how-can-we-write-a-dataframe-to-a-table-in-aws-athena/73752220#73752220 — Prabhakar Reddy, Sep 21 '22 at 02:37

A Good Way To Go From A Pandas Python Dataframe, To Parquet Format, Snappy Compression, To S3

0 Answers0