how to merge multiple parquet files to single parquet file using linux or hdfs command?

Question

I have multiple small parquet files generated as output of hive ql job, i would like to merge the output files to single parquet file?

what is the best way to do it using some hdfs or linux commands?

we used to merge the text files using cat command, but will this work for parquet as well? Can we do it using HiveQL itself when writing output files like how we do it using repartition or coalesc method in spark?

Using "parquet-tools merge" is not recommended. Parquet cuts its file into row_groups that correspond to HDFS blocks. "Parquet-tools merge" only places row_groups after row_groups without merging them. Finally, you get the same problem. You can find more explication in [this ticket](https://issues.apache.org/jira/browse/PARQUET-1115). You also have more explication about "row_groups" for parquet in this [blog](http://ingest.tips/2015/01/31/parquet-row-group-size/). — Nastasia, Aug 22 '18 at 13:50
Following the ticket mentioned by @Nastasia, this issue will not be solved (at least for now). Anyhow, the solution provided by the merge-tools is now to emit a warning (https://github.com/apache/parquet-mr/pull/433). — Markus, Jun 07 '19 at 11:31

score 18 · Answer 1 · edited Feb 23 '18 at 21:09

18

According to this https://issues.apache.org/jira/browse/PARQUET-460 Now you can download the source code and compile parquet-tools which is built in merge command.

java -jar ./target/parquet-tools-1.8.2-SNAPSHOT.jar merge /input_directory/
        /output_idr/file_name

Or using a tool like https://github.com/stripe/herringbone

edited Feb 23 '18 at 21:09

Gray

115,027
24
293
354

answered Oct 07 '16 at 06:42

giaosudau

2,211
6
33
64

4

Anywhere I can just download a jar? Building this is a pain. – samthebest Oct 10 '17 at 10:42
3

Can this be used for files on AWS S3? – Akarsh Gupta Apr 23 '18 at 21:05
1

If you are unifying the files for performance reason you should be aware of parquet-tools merge command limitations: https://issues.apache.org/jira/browse/PARQUET-1115 – Avner Levy Apr 17 '20 at 18:05
3

If you don't want to build `parquet-tools`, there's a docker container with it at https://hub.docker.com/r/nathanhowell/parquet-tools – David Bodow May 01 '21 at 00:34
1

@samthebest yes click the JAR link here. https://mvnrepository.com/artifact/org.apache.parquet/parquet-tools/1.11.1 – ScalaWilliam May 12 '21 at 21:58

score 7 · Answer 2 · answered Jan 25 '23 at 22:02

7

Using duckdb :

import duckdb

duckdb.execute("""
COPY (SELECT * FROM '*.parquet') TO 'merge.parquet' (FORMAT 'parquet');
""")

answered Jan 25 '23 at 22:02

dridk

180
2
13

score 5 · Answer 3 · answered May 18 '17 at 14:16

You can also do it using HiveQL itself, if your execution engine is mapreduce.

You can set a flag for your query, which causes hive to merge small files at the end of your job:

SET hive.merge.mapredfiles=true;

or

SET hive.merge.mapfiles=true;

if your job is a map-only job.

This will cause the hive job to automatically merge many small parquet files into fewer big files. You can control the number of output files with by adjusting hive.merge.size.per.task setting. If you want to have just one file, make sure you set it to a value which is always larger than the size of your output. Also, make sure to adjust hive.merge.smallfiles.avgsize accordingly. Set it to a very low value if you want to make sure that hive always merges files. You can read more about this settings in hive documentation.

how to merge multiple parquet files to single parquet file using linux or hdfs command?

3 Answers3

Linked