Updating values in apache parquet file

Question

I have a quite hefty parquet file where I need to change values for one of the column. One way to do this would be to update those values in source text files and recreate parquet file but I'm wondering if there is less expensive and overall easier solution to this.

@DanOsipov Thanks. I assume this limitation is due to various compression algorithms used where it would not be easy or even possible to update column values. — marcin_koss, Sep 06 '16 at 00:34
I would say, this is a far more fundamental question, rather than a parquet specific question. In the world of high data volumes, where parquet is used a lot, immutability is something that you would like to care about. From this perspective you would like to load the data, transform it, and then write it again. You might consider only writing the columns that you need, which makes it more efficient since it is a columnwise format. — Fokko Driesprong, Dec 15 '16 at 12:10
I understood that you'd like to update a field already written in a previous run. Maybe this article could help. I'm not promoting any product. Please focus on the concepts involved, not on products advertised. https://databricks.com/blog/2018/10/29/simplifying-change-data-capture-with-databricks-delta.html — Richard Gomes, Jun 29 '20 at 09:28

score 18 · Accepted Answer · edited Feb 25 '19 at 11:59

Lets start with basics:

Parquet is a file format that needs to be saved in a file system.

Key questions:

Does parquet support append operations?
Does the file system (namely, HDFS) allow append on files?
Can the job framework (Spark) implement append operations?

Answers:

parquet.hadoop.ParquetFileWriter only supports CREATE and OVERWRITE; there is no append mode. (Not sure but this could potentially change in other implementations -- parquet design does support append)
HDFS allows append on files using the dfs.support.append property
Spark framework does not support append to existing parquet files, and with no plans to; see this JIRA

It is not a good idea to append to an existing file in distributed systems, especially given we might have two writers at the same time.

More details are here:

Thanks for the detailed answer and background info. that is helpful — Keith, Feb 22 '18 at 09:32
This is great explanation. Few points here 1) If PARQUET by design support append then isn't it wrong to say that it is IMMUTABLE. 2) could you please help me to understand why append to an existing file is not good in distributed systems? — Ritesh, Mar 30 '23 at 12:34

score 8 · Answer 2 · answered Sep 06 '18 at 22:34

There are workarounds, but you need to create your parquet file in a certain way to make it easier to update.

Best practices:

A. Use row groups to create parquet files. You need to optimize how many rows of data can go into a row group before features like data compression and dictionary encoding stop kicking in.

B. Scan row groups one at a time and figure out which row groups need to be updated. Generate new parquet files with amended data for each modified row group. It is more memory efficient to work with one row group's worth of data at a time instead of everything in the file.

C. Rebuild the original parquet file by appending unmodified row groups and with modified row groups generated by reading in one parquet file per row group.

it's surprisingly fast to reassemble a parquet file using row groups.

In theory it should be easy to append to existing parquet file if you just strip the footer (stats info), append new row groups and add new footer with update stats, but there isn't an API / Library that supports it..

I’ve used this method to update parquet files. When writing parquet files I create a second parquet file which acts like a primary index which tracks what parquet file / row group a keyed record lives in. I’m able to quickly extract the data, modify it and then reassemble the parquet file using its original row groups, minus the extracted row group, plus the modified row group. Here’s some basic info to help with working with row groups. https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing — David Lee, Jan 28 '23 at 21:54

score 3 · Answer 3 · edited Jun 20 '20 at 09:12

Look at this nice blog which can answer your question and provide a method to perform updates using Spark(Scala):

http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html

Copy & Paste from the blog:

when we need to edit the data, in our data structures (Parquet), that are immutable.

You can add partitions to Parquet files, but you can’t edit the data in place.

But ultimately we can mutate the data, we just need to accept that we won’t be doing it in place. We will need to recreate the Parquet files using a combination of schemas and UDFs to correct the bad data.

If you want to incrementally append the data in Parquet (you did n't ask this question, still it would be useful for other readers) Refer this well written blog:

http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html

Disclaimer: I have n't written those blogs, I just read it and found it might be useful for others.

I understood that the question involves something a bit more complicated than simply appending partitions, as the blog post explains. Suppose a scenario where some data is valid until some more data arrives, which could be easily implemented on a SQL database but requires creativity in case you cannot just update a field of a record, which is the case of parquet files. — Richard Gomes, Jun 29 '20 at 08:41

Thomas Decaux · Answer 4 · 2023-01-27T20:33:24.773

2

You must re-create the file, this is the Hadoop way. Especially if the file is compressed.

Another approach, (very common in Big-data), is to do the update on another Parquet (or ORC) file, then JOIN / UNION at query time.

Well, in 2022, I strongly recommend to use a lake house solution, like deltaLake or Apache Iceberg. They will care about that for you.

edited Jan 27 '23 at 20:33

answered Jun 12 '17 at 08:59

Thomas Decaux

21,738
2
113
124

do you mean kind of versioning system ? – JustTry Jan 27 '23 at 16:44
1

yep ! Now in 2022 I strongly recommend to use a lake house solution, like deltaLake or Apache Iceberg. They will care about that for you. – Thomas Decaux Jan 27 '23 at 20:32

Updating values in apache parquet file

4 Answers4

Linked