pandas write dataframe to parquet format with append

Question

I am trying to write a pandas dataframe to parquet file format (introduced in most recent pandas version 0.21.0) in append mode. However, instead of appending to the existing file, the file is overwritten with new data. What am i missing?

the write syntax is

df.to_parquet(path, mode='append')

the read syntax is

pd.read_parquet(path)

[try opening the file in append mode](https://stackoverflow.com/a/17531025/1278112) — Shihe Zhang, Nov 09 '17 at 05:57
this does not work (makes not difference from the previous situation) — Siraj S., Nov 09 '17 at 18:16
from this link "https://stackoverflow.com/questions/39234391/how-to-append-data-to-an-existing-parquet-file" it looks like append is not supported in parquet client API — Siraj S., Nov 09 '17 at 18:19
[In the doc](https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.to_parquet.html#pandas-dataframe-to-parquet) there is no `append` mode for `to_parquet()` API.If you want to append to a file, the `append` mode is for the file.That's what I try to express earlier. — Shihe Zhang, Nov 10 '17 at 01:04
see here https://stackoverflow.com/questions/47113813/using-pyarrow-how-do-you-append-to-parquet-file — Andrey, May 24 '21 at 10:17
In case you want append to the SAME file, then forget my comment, but sometimes it could be usefull write the new parquet file to the same directory with another name. So, next time you cand read to the directory instead an specific file and you will get the data in every parquet file on that directory — Ariel Catala Valencia, Jul 15 '22 at 14:14

score 19 · Answer 1 · answered Oct 26 '22 at 14:48

Looks like its possible to append row groups to already existing parquet file using fastparquet. This is quite a unique feature, since most libraries don't have this implementation.

Below is from pandas doc:

DataFrame.to_parquet(path, engine='auto', compression='snappy', index=None, partition_cols=None, **kwargs)

we have to pass in both engine and **kwargs.

engine{‘auto’, ‘pyarrow’, ‘fastparquet’}

**kwargs - Additional arguments passed to the parquet library.

**kwargs - here we need to pass is: append=True (from fastparquet)

import pandas as pd
import os.path

file_path = "D:\\dev\\output.parquet"
df = pd.DataFrame(data={'col1': [1, 2,], 'col2': [3, 4]})
if not os.path.isfile(file_path):
  df.to_parquet(file_path, engine='fastparquet')
else:
  df.to_parquet(file_path, engine='fastparquet', append=True)

If append is set to True and the file does not exist then you will see below error

AttributeError: 'ParquetFile' object has no attribute 'fmd'

Running above script 3 times I have below data in parquet file.

If I inspect the metadata, I can see that this resulted in 3 row groups.

Note:

Append could be inefficient if you write too many small row groups. Typically recommended size of a row group is closer to 100,000 or 1,000,000 rows. This has a few benefits over very small row groups. Compression will work better, since compression operates within a row group only. There will also be less overhead spent on storing statistics, since each row group stores its own statistics.

I tried this with pyarrow and it failed, so it seems to only work with fastparquet as the author suggests — DarkHark, Jun 12 '23 at 21:22

Victor Faro · Answer 2 · 2019-09-27T11:35:08.453

6

To append, do this:

import pandas as pd 
import pyarrow.parquet as pq
import pyarrow as pa

dataframe = pd.read_csv('content.csv')
output = "/Users/myTable.parquet"

# Create a parquet table from your dataframe
table = pa.Table.from_pandas(dataframe)

# Write direct to your parquet file
pq.write_to_dataset(table , root_path=output)

This will automatically append into your table.

edited Sep 27 '19 at 11:35

answered Sep 26 '19 at 17:16

Victor Faro

159
1
6

2

it will create directory with few parquet files, as pyarrow dataset – banderlog013 Mar 24 '21 at 10:21

score 3 · Answer 3 · edited Mar 11 '23 at 20:28

I used the awswrangler library. It works like a charm

Below are the reference docs

https://aws-data-wrangler.readthedocs.io/en/latest/stubs/awswrangler.s3.to_parquet.html

I have read from kinesis stream and used kinesis-python library to consume the message and writing to s3 . processing logic of json I have not included as this post deals with problem unable to append data to s3. Executed in aws sagemaker jupyter

Below is the sample code I used:

!pip install awswrangler
import awswrangler as wr
import pandas as pd
evet_data=pd.DataFrame({'a': [a], 'b':[b],'c':[c],'d':[d],'e': [e],'f':[f],'g': [g]},columns=['a','b','c','d','e','f','g'])
#print(evet_data)
s3_path="s3://<your bucker>/table/temp/<your folder name>/e="+e+"/f="+str(f)
try:
    wr.s3.to_parquet(
    df=evet_data,
    path=s3_path,
    dataset=True,
    partition_cols=['e','f'],
    mode="append",
    database="wat_q4_stg",
    table="raw_data_v3",
    catalog_versioning=True  # Optional
    )
    print("write successful")       
except Exception as e:
    print(str(e))

Any clarifications ready to help. In few more posts I have read to read data and overwrite again. But as the data gets larger it will slow down the process. It is inefficient

Hey, thanks for this - it seems to create snappy.parquet files, Is there a way to create singular parquet files, or at least non-snappy files? — ethereumbrella, Feb 11 '21 at 20:31

score 1 · Answer 4 · answered Mar 10 '18 at 12:02

1

There is no append mode in pandas.to_parquet(). What you can do instead is read the existing file, change it, and write back to it overwriting it.

answered Mar 10 '18 at 12:02

ben26941

1,580
14
20

score 0 · Answer 5 · answered Sep 19 '22 at 14:35

Use the fastparquet write function

from fastparquet import write

write(file_name, df, append=True)

The file must already exist as I understand it.

API is available here (for now at least): https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write

score 0 · Answer 6 · answered Jul 07 '23 at 11:10

If you are considering the use of partitions:

As per Pyarrow doc (this is the function called behind the scene when using partitions), you might want to combine partition_cols with a unique basename_template name. i.e. something like the following:

df.to_parquet(root_path, partition_cols=["..."], basename_template="{i}")

You could omit basename_template if df is not overlapping existing data. But if you do have overlaps, it would create duplicate .parquet files.

This is very handy if your partition column consists of timestamp. This way you can actually have a "rolling" DataFrame and there would be no duplicate being written, only new files corresponding to new times would get created.

score -1 · Answer 7 · edited Aug 02 '22 at 12:04

-1

Pandas to_parquet() can handle both single files as well as directories with multiple files in it. Pandas will silently overwrite the file, if the file is already there. To append to a parquet object just add a new file to the same parquet directory.

os.makedirs(path, exist_ok=True)

# write append (replace the naming logic with what works for you)
filename = f'{datetime.datetime.utcnow().timestamp()}.parquet'
df.to_parquet(os.path.join(path, filename))

# read
pd.read_parquet(path)

edited Aug 02 '22 at 12:04

jtlz2

7,700
9
64
114

answered Dec 13 '21 at 15:30

natbusa

1,570
1
18
25

It strikes me that this would scale linearly in time, i.e. better than does `append` mode as suggested by @Naveen in https://stackoverflow.com/a/64814917/1021819 - am I right? And `to_parquet()` supports S3, correct? – jtlz2 Aug 02 '22 at 12:06

pandas write dataframe to parquet format with append

7 Answers7

Linked