I'm trying to go from a pandas data frame into AWS S3 in a specific format, but the main pain point of this for me is getting the data frame into a parquet format file. I've been trying to do this using pyarrow and I know there are specific functions to do this kind of thing. The closest it seems I've gotten to doing this is with code something along the lines of:
import pyarrow
buf = pyarrow.BufferOutputStream()
dataframe.to_parquet(buf, compression=None)
final = pyarrow.compress(buf, codec='snappy')
s3_client.Object('bucket','key/file.parquet.snappy').put(Body=final.getvalue().to_pybytes())
This will give me a file in S3 but when I try to preview it in Athena or Databrew, it can't read it. I've tried getting rid of the snappy compression, but I got the same issue with the compression step removed, so I'm assuming there is an issue with the parquet format conversion, but I can't seem to find alternatives in python. Also using the snappy compression parameter on the to_parquet function didn't seem to compress it to the point I was expecting.
I've also tried converting the data frame to a table and using the write_table function follow by practically the same steps, but got the same conclusion.
My questions are, am I misunderstanding the process to do this using pyarrow? Is there a better alternative I'm not seeing? Is the conversion from parquet to bytes messing with the parquet format once it's placed in S3?