Its tricky appending data to an existing parquet file. At least no easy way of doing this (Most known libraries don't support this).
Parquet design does support append feature. One way to append data is to write a new row group and then recalculate statistics and update the stats. Although will be terrible for small updates (will result in poor compression and too many small row groups).
However this is not implemented by most libraries. Here is an interesting discussion I found regarding the same.
I'm closing as Won't Fix. Trying to modify existing files (overwriting
the existing file footer) is a pretty big can of worms, and would add
a bunch of complication to the codebase to initialize various classes
with a partially written file
Here is a feature request for spark as well which will not be implemented.
I'm closing this as invalid. It is not a good idea to append to an
existing file in distributed systems, especially given we might have
two writers at the same time.
Other answer on this thread - This simply creates new file under the same directory. However, from what I see, this might be the only feasible option for most of the people.
what other options we have?
- Delete and recreate the entire parquet file every time there is a need to update/append data. Best to batch the data beforehand to reduce the frequency of file recreation.
- Write multiple parquet files. Then combine them at a later stage.
- Write multiple parquet files. The tool you are using to read the parquet files may support reading multiple files in a directory as a single file. Lot of big data tools support this. Be careful not to write too many small files which will result in terrible read performance.
- Switch to open table formats like Iceberg/Delta that supports append/updates/deletes. However be vary of making too many small updates/appends/deletes here as well.
EDIT: I did come across a python based library(fastparquet) that allows append. The same might be implemented in the future by other libraries across other language like Java as well.