I have a quite hefty parquet file where I need to change values for one of the column. One way to do this would be to update those values in source text files and recreate parquet file but I'm wondering if there is less expensive and overall easier solution to this.
-
3No. You have to recreate the file. – Dan Osipov Mar 03 '15 at 22:35
-
@DanOsipov Thanks. I assume this limitation is due to various compression algorithms used where it would not be easy or even possible to update column values. – marcin_koss Sep 06 '16 at 00:34
-
I would say, this is a far more fundamental question, rather than a parquet specific question. In the world of high data volumes, where parquet is used a lot, immutability is something that you would like to care about. From this perspective you would like to load the data, transform it, and then write it again. You might consider only writing the columns that you need, which makes it more efficient since it is a columnwise format. – Fokko Driesprong Dec 15 '16 at 12:10
-
1I understood that you'd like to update a field already written in a previous run. Maybe this article could help. I'm not promoting any product. Please focus on the concepts involved, not on products advertised. https://databricks.com/blog/2018/10/29/simplifying-change-data-capture-with-databricks-delta.html – Richard Gomes Jun 29 '20 at 09:28
4 Answers
Lets start with basics:
Parquet is a file format that needs to be saved in a file system.
Key questions:
- Does parquet support
append
operations? - Does the file system (namely, HDFS) allow
append
on files? - Can the job framework (Spark) implement
append
operations?
Answers:
parquet.hadoop.ParquetFileWriter
only supportsCREATE
andOVERWRITE
; there is noappend
mode. (Not sure but this could potentially change in other implementations -- parquet design does supportappend
)HDFS allows
append
on files using thedfs.support.append
propertySpark framework does not support
append
to existing parquet files, and with no plans to; see this JIRA
It is not a good idea to append to an existing file in distributed systems, especially given we might have two writers at the same time.
More details are here:

- 33,841
- 14
- 113
- 198

- 2,839
- 2
- 21
- 31
-
-
This is great explanation. Few points here 1) If PARQUET by design support append then isn't it wrong to say that it is IMMUTABLE. 2) could you please help me to understand why append to an existing file is not good in distributed systems? – Ritesh Mar 30 '23 at 12:34
There are workarounds, but you need to create your parquet file in a certain way to make it easier to update.
Best practices:
A. Use row groups to create parquet files. You need to optimize how many rows of data can go into a row group before features like data compression and dictionary encoding stop kicking in.
B. Scan row groups one at a time and figure out which row groups need to be updated. Generate new parquet files with amended data for each modified row group. It is more memory efficient to work with one row group's worth of data at a time instead of everything in the file.
C. Rebuild the original parquet file by appending unmodified row groups and with modified row groups generated by reading in one parquet file per row group.
it's surprisingly fast to reassemble a parquet file using row groups.
In theory it should be easy to append to existing parquet file if you just strip the footer (stats info), append new row groups and add new footer with update stats, but there isn't an API / Library that supports it..

- 129
- 1
- 3
-
-
1I’ve used this method to update parquet files. When writing parquet files I create a second parquet file which acts like a primary index which tracks what parquet file / row group a keyed record lives in. I’m able to quickly extract the data, modify it and then reassemble the parquet file using its original row groups, minus the extracted row group, plus the modified row group. Here’s some basic info to help with working with row groups. https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing – David Lee Jan 28 '23 at 21:54
Look at this nice blog which can answer your question and provide a method to perform updates using Spark(Scala):
http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html
Copy & Paste from the blog:
when we need to edit the data, in our data structures (Parquet), that are immutable.
You can add partitions to Parquet files, but you can’t edit the data in place.
But ultimately we can mutate the data, we just need to accept that we won’t be doing it in place. We will need to recreate the Parquet files using a combination of schemas and UDFs to correct the bad data.
If you want to incrementally append the data in Parquet (you did n't ask this question, still it would be useful for other readers) Refer this well written blog:
http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html
Disclaimer: I have n't written those blogs, I just read it and found it might be useful for others.

- 1
- 1

- 1,338
- 13
- 15
-
I understood that the question involves something a bit more complicated than simply appending partitions, as the blog post explains. Suppose a scenario where some data is valid until some more data arrives, which could be easily implemented on a SQL database but requires creativity in case you cannot just update a field of a record, which is the case of parquet files. – Richard Gomes Jun 29 '20 at 08:41
You must re-create the file, this is the Hadoop way. Especially if the file is compressed.
Another approach, (very common in Big-data), is to do the update on another Parquet (or ORC) file, then JOIN / UNION at query time.
Well, in 2022, I strongly recommend to use a lake house solution, like deltaLake or Apache Iceberg. They will care about that for you.

- 21,738
- 2
- 113
- 124
-
-
1yep ! Now in 2022 I strongly recommend to use a lake house solution, like deltaLake or Apache Iceberg. They will care about that for you. – Thomas Decaux Jan 27 '23 at 20:32