How to append data to an existing parquet file

Question

I'm using the following code to create ParquetWriter and to write records to it.

ParquetWriter<GenericRecord> parquetWriter = new ParquetWriter(path, writeSupport, CompressionCodecName.SNAPPY, BLOCK_SIZE, PAGE_SIZE);

final GenericRecord record = new GenericData.Record(avroSchema);

parquetWriter.write(record);

But it only allows to create new files(at the specfied path). Is there a way to append data to an existing parquet file (at path)? Caching parquetWriter is not feasible in my case.

do you find any solution to append into a parquet file? – sahara108 Nov 06 '21 at 03:31 — sahara108, Nov 06 '21 at 03:31

score 22 · Answer 1 · answered Feb 09 '17 at 15:20

22

There is a Spark API SaveMode called append: https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/SaveMode.html which I believe solves your problem.

Example of use:

df.write.mode('append').parquet('parquet_data_file')

answered Feb 09 '17 at 15:20

bluszcz

4,054
4
33
52

6

Since the parquet client API doesn't support Append How can spark ? – Devas Mar 29 '17 at 07:21
31

this code write to parquet folder by adding a new file it doesn't effect existing files – agonen Apr 29 '18 at 13:55
6

As @agonen stated it creates a new file. Does anybody know if there's a way to append data to the existing files? – Jules Nov 25 '20 at 12:47
is it possible to append to a s3 object? – Eduardo EPF Sep 01 '22 at 14:26

ns15 · Answer 2 · 2022-10-26T15:24:53.910

Its tricky appending data to an existing parquet file. At least no easy way of doing this (Most known libraries don't support this).

Parquet design does support append feature. One way to append data is to write a new row group and then recalculate statistics and update the stats. Although will be terrible for small updates (will result in poor compression and too many small row groups).

However this is not implemented by most libraries. Here is an interesting discussion I found regarding the same.

I'm closing as Won't Fix. Trying to modify existing files (overwriting the existing file footer) is a pretty big can of worms, and would add a bunch of complication to the codebase to initialize various classes with a partially written file

Here is a feature request for spark as well which will not be implemented.

I'm closing this as invalid. It is not a good idea to append to an existing file in distributed systems, especially given we might have two writers at the same time.

Other answer on this thread - This simply creates new file under the same directory. However, from what I see, this might be the only feasible option for most of the people.

what other options we have?

Delete and recreate the entire parquet file every time there is a need to update/append data. Best to batch the data beforehand to reduce the frequency of file recreation.
Write multiple parquet files. Then combine them at a later stage.
Write multiple parquet files. The tool you are using to read the parquet files may support reading multiple files in a directory as a single file. Lot of big data tools support this. Be careful not to write too many small files which will result in terrible read performance.
Switch to open table formats like Iceberg/Delta that supports append/updates/deletes. However be vary of making too many small updates/appends/deletes here as well.

EDIT: I did come across a python based library(fastparquet) that allows append. The same might be implemented in the future by other libraries across other language like Java as well.

score -7 · Answer 3 · answered Aug 31 '16 at 06:54

-7

Parquet is a columnar file, It optimizes writing all columns together. If any edit it requires to rewrite the file.

From Wiki

A column-oriented database serializes all of the values of a column together, then the values of the next column, and so on. For our example table, the data would be stored in this fashion:

10:001,12:002,11:003,22:004;
Smith:001,Jones:002,Johnson:003,Jones:004;
Joe:001,Mary:002,Cathy:003,Bob:004;
40000:001,50000:002,44000:003,55000:004;

Some links

https://en.wikipedia.org/wiki/Column-oriented_DBMS

https://parquet.apache.org/

answered Aug 31 '16 at 06:54

vgunnu

826
8
6

11

Above answer is inaccurate. Parquet slices columns into chunks and allows parts of a column to be stored in several chunks within a single file, thus append is possible. If you read the design philosophy behind parquet - it is quite clear that format was designed for appending, judging by block footer structured. – travnik Feb 07 '17 at 13:46
I think append is not supported in parquet client API, I know that it was there in spark but I have doubt on the column storage which support reading the required chunk only. In this case how the append works, there may be chance for appending in the existing column chunk. Do you have a link which have architectural details. – Devas Feb 14 '17 at 07:22

How to append data to an existing parquet file

3 Answers3

Linked