How to write parquet file from pandas dataframe in S3 in python

Question

I have a pandas dataframe. i want to write this dataframe to parquet file in S3. I need a sample code for the same.I tried to google it. but i could not get a working sample code.

score 59 · Answer 1 · edited May 28 '19 at 15:37

59

For your reference, I have the following code works.

s3_url = 's3://bucket/folder/bucket.parquet.gzip'
df.to_parquet(s3_url, compression='gzip')

In order to use to_parquet, you need pyarrow or fastparquet to be installed. Also, make sure you have correct information in your config and credentials files, located at .aws folder.

Edit: Additionally, s3fs is needed. see https://stackoverflow.com/a/54006942/1862909

edited May 28 '19 at 15:37

otmezger

10,410
21
64
90

answered Nov 27 '18 at 15:00

Wai Kiat

789
1
8
13

4

I have multiple profiles in my .aws/config and credentials files... is there a way to set which profile to use? (I suppose setting my ENV var: AWS_PROFILE= would work, but would be nice to do it in code) – Brian Wylie Oct 24 '19 at 17:07
5

Yes, you first import `boto3`, then set your profile using `session = boto3.Session(profile_name={your_profile}")` – Wai Kiat Oct 25 '19 at 02:24
4

For completeness, if you want a `.parquet` as output file, drop the compression arg and change the file name to `.parquet`: `s3_url = 's3://bucket/folder/bucket.parquet' df.to_parquet(s3_url) ` – Rajat Sep 24 '21 at 04:33
Fully agree with ending filename as parquet, because .gzip implies you need to unzip it. My comment is to warn of a caveat using to_parquet(...). If you use engine=fast_parquet and provide partition_cols, to_parquet leaves a trail of directory starting with "s3:" at your working dir. Be warned. – michaelgbj Nov 11 '21 at 16:58

gurjarprateek · Answer 2 · 2023-05-15T19:32:23.263

28

the below function gets parquet output in a buffer and then write buffer.values() to S3 without any need to save parquet locally

Also, since you're creating an s3 client you can create credentials using aws s3 keys that can be either stored locally, in an airflow connection or aws secrets manager

def dataframe_to_s3(s3_client, input_datafame, bucket_name, filepath, format):

        if format == 'parquet':
            out_buffer = BytesIO()
            input_datafame.to_parquet(out_buffer, index=False)

        elif format == 'csv':
            out_buffer = StringIO()
            input_datafame.to_parquet(out_buffer, index=False)

        s3_client.put_object(Bucket=bucket_name, Key=filepath, Body=out_buffer.getvalue())

S3_client is nothing but a boto3 client object.Hope this helps!

courtesy- https://stackoverflow.com/a/40615630/12036254

edited May 15 '23 at 19:32

answered Sep 08 '19 at 02:50

gurjarprateek

409
5
13

1

For anyone wondering what is input_dataframe.to_parquet: https://stackoverflow.com/questions/41066582/python-save-pandas-data-frame-to-parquet-file – JOHN Feb 18 '20 at 06:43
1

For data having timestamp: https://stackoverflow.com/questions/53893554/transfer-and-write-parquet-with-python-and-pandas-got-timestamp-error – JOHN Feb 18 '20 at 07:08
1

I followed this and got garbage values written in the file. :( What could be going wrong? – ShwetaJ Nov 16 '20 at 16:39
@gurjarprateek, it seems some of data is being lost even though I'm not seeing any errors. At first I believe to be lack of memory (DFS ares somewhat large), but I'd expect a error message – Lucas Abreu Feb 02 '22 at 15:43
@LucasAbreu this could happen if the size of data is greater than system memory – gurjarprateek Jul 25 '22 at 23:07
Is there a way to specify the compression scheme in this? – khan Apr 24 '23 at 16:22
@khan you can modify the function to add compression. by default it uses snappy compression https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html#pandas-dataframe-to-parquet – gurjarprateek Apr 24 '23 at 19:45
Oh yes, I see..totally missed it. – khan Apr 25 '23 at 02:45

score 19 · Accepted Answer · edited Apr 22 '21 at 00:51

19

First ensure that you have pyarrow or fastparquet installed with pandas.

Then install boto3 and aws cli. Use aws cli to set up the config and credentials files, located at .aws folder.

Here is a simple script using pyarrow, and boto3 to create a temporary parquet file and then send to AWS S3.

Sample code excluding imports:

def main():
    data = {0: {"data1": "value1"}}
    df = pd.DataFrame.from_dict(data, orient='index')
    write_pandas_parquet_to_s3(
        df, "bucket", "folder/test/file.parquet", ".tmp/file.parquet")


def write_pandas_parquet_to_s3(df, bucketName, keyName, fileName):
    # dummy dataframe
    table = pa.Table.from_pandas(df)
    pq.write_table(table, fileName)

    # upload to s3
    s3 = boto3.client("s3")
    BucketName = bucketName
    with open(fileName) as f:
       object_data = f.read()
       s3.put_object(Body=object_data, Bucket=BucketName, Key=keyName)

edited Apr 22 '21 at 00:51

pitchblack408

2,913
4
36
54

answered Nov 27 '18 at 07:21

andreas

226
2
3

4

Your example would have looked cleaner with the imports. I also think you will get more points if you add a second example using BytesIO as a buffer. – pitchblack408 Apr 22 '21 at 00:45
3

import pyarrow as pa , import pyarrow.parquet as pq is needed. – Akhilendra Sep 02 '21 at 11:54
how would you write partitioned parquet? – dshri Oct 04 '21 at 08:26

score 12 · Answer 4 · answered Sep 12 '20 at 12:37

12

For python 3.6+, AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet

to install do;

pip install awswrangler

if you want to write your pandas dataframe as a parquet file to S3 do;

import awswrangler as wr
wr.s3.to_parquet(
    dataframe=df,
    path="s3://my-bucket/key/my-file.parquet"
)

answered Sep 12 '20 at 12:37

Vincent Claes

3,960
3
44
62

4

Caveat: unlike `pandas.DataFrame.to_parquet()`, wrangler has no option to pass kwargs to the underlying parquet library. This means that you can't set lower-level options if you need to. I ran into this issue when PyArrow failed to infer the table schema-- in pandas, you can work around this by [explicitly defining](https://stackoverflow.com/a/66805787/4212158) a PyArrow schema – crypdick Jul 13 '21 at 15:10

score 1 · Answer 5 · answered Apr 13 '23 at 23:00

Just to provide a further example using kwargs to force an overwrite.

My use case is that the partition structure ensures that if I reprocess an input file the output parquet should overwrite whatever is in the partition. To do that I am using kwargs passed through to pyarrow:

s3_url = "s3://<your-bucketname>/<your-folderpath>/"
df.to_parquet(s3_url, 
              compression='snappy', 
              engine = 'pyarrow',
              partition_cols = ["GSDate","LogSource", "SourceDate"],
              existing_data_behavior = 'delete_matching')

That last argument (existing_data_behaviour) is part of **kwargs passed through to underlying pyarrow write_dataset. (https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html#pyarrow.dataset.write_dataset)

Without that a rerun would create duplicate data. As noted above, this requires s3fs

How to write parquet file from pandas dataframe in S3 in python

5 Answers5

Linked