I have a pandas dataframe. i want to write this dataframe to parquet file in S3. I need a sample code for the same.I tried to google it. but i could not get a working sample code.
5 Answers
For your reference, I have the following code works.
s3_url = 's3://bucket/folder/bucket.parquet.gzip'
df.to_parquet(s3_url, compression='gzip')
In order to use to_parquet
, you need pyarrow
or fastparquet
to be installed. Also, make sure you have correct information in your config
and credentials
files, located at .aws
folder.
Edit: Additionally, s3fs
is needed. see https://stackoverflow.com/a/54006942/1862909
-
4I have multiple profiles in my .aws/config and credentials files... is there a way to set which profile to use? (I suppose setting my ENV var: AWS_PROFILE=
would work, but would be nice to do it in code) – Brian Wylie Oct 24 '19 at 17:07 -
5Yes, you first import `boto3`, then set your profile using `session = boto3.Session(profile_name={your_profile}")` – Wai Kiat Oct 25 '19 at 02:24
-
4For completeness, if you want a `.parquet` as output file, drop the compression arg and change the file name to `.parquet`: `s3_url = 's3://bucket/folder/bucket.parquet' df.to_parquet(s3_url) ` – Rajat Sep 24 '21 at 04:33
-
Fully agree with ending filename as parquet, because .gzip implies you need to unzip it. My comment is to warn of a caveat using to_parquet(...). If you use engine=fast_parquet and provide partition_cols, to_parquet leaves a trail of directory starting with "s3:" at your working dir. Be warned. – michaelgbj Nov 11 '21 at 16:58
the below function gets parquet output in a buffer and then write buffer.values() to S3 without any need to save parquet locally
Also, since you're creating an s3 client you can create credentials using aws s3 keys that can be either stored locally, in an airflow connection or aws secrets manager
def dataframe_to_s3(s3_client, input_datafame, bucket_name, filepath, format):
if format == 'parquet':
out_buffer = BytesIO()
input_datafame.to_parquet(out_buffer, index=False)
elif format == 'csv':
out_buffer = StringIO()
input_datafame.to_parquet(out_buffer, index=False)
s3_client.put_object(Bucket=bucket_name, Key=filepath, Body=out_buffer.getvalue())
S3_client is nothing but a boto3 client object.Hope this helps!

- 409
- 5
- 13
-
1For anyone wondering what is input_dataframe.to_parquet: https://stackoverflow.com/questions/41066582/python-save-pandas-data-frame-to-parquet-file – JOHN Feb 18 '20 at 06:43
-
1For data having timestamp: https://stackoverflow.com/questions/53893554/transfer-and-write-parquet-with-python-and-pandas-got-timestamp-error – JOHN Feb 18 '20 at 07:08
-
1I followed this and got garbage values written in the file. :( What could be going wrong? – ShwetaJ Nov 16 '20 at 16:39
-
@gurjarprateek, it seems some of data is being lost even though I'm not seeing any errors. At first I believe to be lack of memory (DFS ares somewhat large), but I'd expect a error message – Lucas Abreu Feb 02 '22 at 15:43
-
@LucasAbreu this could happen if the size of data is greater than system memory – gurjarprateek Jul 25 '22 at 23:07
-
-
@khan you can modify the function to add compression. by default it uses snappy compression https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html#pandas-dataframe-to-parquet – gurjarprateek Apr 24 '23 at 19:45
-
First ensure that you have pyarrow or fastparquet installed with pandas.
Then install boto3 and aws cli. Use aws cli to set up the config and credentials files, located at .aws folder.
Here is a simple script using pyarrow, and boto3 to create a temporary parquet file and then send to AWS S3.
Sample code excluding imports:
def main():
data = {0: {"data1": "value1"}}
df = pd.DataFrame.from_dict(data, orient='index')
write_pandas_parquet_to_s3(
df, "bucket", "folder/test/file.parquet", ".tmp/file.parquet")
def write_pandas_parquet_to_s3(df, bucketName, keyName, fileName):
# dummy dataframe
table = pa.Table.from_pandas(df)
pq.write_table(table, fileName)
# upload to s3
s3 = boto3.client("s3")
BucketName = bucketName
with open(fileName) as f:
object_data = f.read()
s3.put_object(Body=object_data, Bucket=BucketName, Key=keyName)

- 2,913
- 4
- 36
- 54

- 226
- 2
- 3
-
4Your example would have looked cleaner with the imports. I also think you will get more points if you add a second example using BytesIO as a buffer. – pitchblack408 Apr 22 '21 at 00:45
-
3
-
For python 3.6+, AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet
to install do;
pip install awswrangler
if you want to write your pandas dataframe as a parquet file to S3 do;
import awswrangler as wr
wr.s3.to_parquet(
dataframe=df,
path="s3://my-bucket/key/my-file.parquet"
)

- 3,960
- 3
- 44
- 62
-
4Caveat: unlike `pandas.DataFrame.to_parquet()`, wrangler has no option to pass kwargs to the underlying parquet library. This means that you can't set lower-level options if you need to. I ran into this issue when PyArrow failed to infer the table schema-- in pandas, you can work around this by [explicitly defining](https://stackoverflow.com/a/66805787/4212158) a PyArrow schema – crypdick Jul 13 '21 at 15:10
Just to provide a further example using kwargs to force an overwrite.
My use case is that the partition structure ensures that if I reprocess an input file the output parquet should overwrite whatever is in the partition. To do that I am using kwargs passed through to pyarrow:
s3_url = "s3://<your-bucketname>/<your-folderpath>/"
df.to_parquet(s3_url,
compression='snappy',
engine = 'pyarrow',
partition_cols = ["GSDate","LogSource", "SourceDate"],
existing_data_behavior = 'delete_matching')
That last argument (existing_data_behaviour) is part of **kwargs passed through to underlying pyarrow write_dataset. (https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html#pyarrow.dataset.write_dataset)
Without that a rerun would create duplicate data. As noted above, this requires s3fs

- 41
- 5