1

Does parquet allow appending to a parquet file periodically ?

How does appending relate to partitioning if any ? For example if i was able to identify a column that had low cardinality and partition it by that column, if i were to append more data to it would parquet be able to automatically append data while preserving partitioning or would one have to repartition the file ?

Abhishek Malik
  • 305
  • 4
  • 14
  • 2
    parquet files cannot be modified is my understanding. pyarrow does not allow appending to parquet files neither. The only option to 'append' to a partitioned parquet file is using Spark API: https://stackoverflow.com/a/42140475/1157754. I'd point you to the comments under that answer though as this is not truly an 'append'. – TDrabas Sep 09 '21 at 21:40

2 Answers2

3

Does parquet allow appending to a parquet file periodically ?

Yes and No. The parquet spec describes a format that could be appended to by reading the existing footer, writing a row group, and then writing out a modified footer. This process is described a bit here.

Not all implementations support this operation. The only implementation I am aware of at the moment is fastparquet (see this answer). It is usually acceptable, less complexity, and potentially better performance to cache and batch, either by caching in memory or writing the small files and batching them together at some point later.

How does appending relate to partitioning if any?

Parquet does not have any concept of partitioning.

Many tools that support parquet implement partitioning. For example, pyarrow has a datasets feature which supports partitioning. If you were to append new data using this feature a new file would be created in the appropriate partition directory.

Pace
  • 41,875
  • 13
  • 113
  • 156
  • appending row groups to an existing parquet file is possible using fast parquet library – ns15 Oct 29 '22 at 12:21
  • Hmm...I think fastparquet's append feature is used to add files to data sets and not to add row groups to existing files. – Pace Nov 01 '22 at 09:12
  • It does append a new row group.. I have posted the answer below. Its a very useful feature that I didn't know was possible. – ns15 Nov 01 '22 at 09:26
  • 1
    @shadow0359 I've updated my answer to reflect that fastparquet supports this operation. Thanks for the help! – Pace Nov 02 '22 at 21:40
0

Its possible to append row groups to already existing parquet file using fastparquet.

Here is my SO answer on the same topic.

From fast parquet docs

append: bool (False) or ‘overwrite’ If False, construct data-set from scratch; if True, add new row-group(s) to existing data-set. In the latter case, the data-set must exist, and the schema must match the input data.

from fastparquet import write
write('output.parquet', df, append=True)

EXAMPLE UPDATE:

Here is a PY script. The first run, it will create a file with one row group. Subsequent runs, it will append row groups to the same parquet file.

import os.path
import pandas as pd
from fastparquet import write

df = pd.DataFrame(data={'col1': [1, 2,], 'col2': [3, 4]})
file_path = "C:\\Users\\nsuser\\dev\\write_parq_row_group.parquet"
if not os.path.isfile(file_path):
  write(file_path, df)
else:
  write(file_path, df, append=True)
ns15
  • 5,604
  • 47
  • 51
  • Can you maybe include a complete example? I have been unable to figure out how to use fastparquet to actually append to existing files. [This gist](https://gist.github.com/westonpace/edf7a3cc7532f9b4d7d3a689ade25478) is my current test. – Pace Nov 02 '22 at 16:32
  • @Pace could you check this answer? I have added the example here. https://stackoverflow.com/a/74209756/6563567 – ns15 Nov 02 '22 at 19:02
  • @Pace I have updated the gist with example. – ns15 Nov 02 '22 at 19:21
  • That works great, thanks. I was getting the 'fmd' error and hadn't realized I needed to leave off `append` on the first call. – Pace Nov 02 '22 at 21:33