3

How to concatenate small parquet files in HIVE when below are in place.

  • Partitions are created dynamically on HIVE table.
  • Table is EXTERNAL.

Solution Tried so far but for ORC files which has bug : For ORC file I was using below command in loop for all partition values and it works fine. But data is lost after concatenation is done which is bug in HIVE https://issues.apache.org/jira/browse/HIVE-17280

I am running HIVE query on EMR cluster which is using hive 2.3.3 and this wasn’t fixed until hive 3.0.0

Command Used to achieve this but with ORC file. I need to do so for parquet files.

ALTER TABLE HIVE_DB.HIVE_TABLE_NM PARTITION(partition_field_nm ='${partition_value}') CONCATENATE;

this is used for ORC file.

Need to do similar concatenation of small parquet files into bigger files.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
Deep
  • 99
  • 4
  • 5
  • Try merge and distribute by solution described here: https://stackoverflow.com/a/45266244/2700344 – leftjoin Jun 24 '19 at 08:48
  • Link given as solution mainly talks if job run on MAP-R based and ORC files. I am looking for concatenating small parquet files present in partitions. Execution engine is TEZ in my case. – Deep Jun 24 '19 at 10:09
  • There is also solution for Tez described. Do insert overwrite select from itself using these settings , it should work – leftjoin Jun 24 '19 at 10:15

2 Answers2

3

Merging the parquet files through Hive is not available at the moment as per my knowledge. Also, "parquet-tools merge" is not recommended since it might create issues with the data/schema.

However, this can be achieved with the help of Impala.

If you want to merge the files in a table by partition wise, then you can create a new table with only the partition data from the existing table with more files in HDFS and drop the partitions from it after that. Steps would like below,

For example,

CREATE TABLE tabA LIKE tabB;

INSERT INTO tabB SELECT * FROM tabA WHERE partition_key=1;

ALTER TABLE tabA DROP PARTITION (partition_key=1);

INSERT INTO tabA PARTITION(partition_key=1) select * from tabB;

DROP TABLE tabB; 

where, tabA - original table with more parquet files in HDFS, tabB - intermediate table created to merge the parquet files of tabA

Gomz
  • 850
  • 7
  • 17
  • 1
    Thanks for your help in this regard. I was thinking of using parquet-tools merge but will not go ahead as there are disadvantages. I will see how to use your 2nd way of merging in dynamic partitions. – Deep Jun 25 '19 at 06:54
1

I schedule something like that once a week on tables that get fragmented due to frequent small data additions.

CREATE TABLE tabB LIKE tabA;
INSERT INTO tabB PARTITION (colX = tabA.colX) SELECT col1, col2, colX FROM tabA;
ALTER TABLE tabA RENAME TO tabA_tmp;
ALTER TABLE tabB RENAME TO tabA;
DROP TABLE tabA_tmp;

Note that the tabA_tmp operations could be replaced by a single DROP TABLE tabA as you build confidence in your defragmentation script.

formixian
  • 1,549
  • 16
  • 25