I have a hive table which contains call data records(CDRs). I have the table partitioned on the phone number and bucketed on call_date. Now when I am inserting data into hive the back dated call_date are creating small files in my buckets which is creating name node metadata increase and performance slowdown. Is there a way to merge these small files into one.
Asked
Active
Viewed 1.6k times
2
-
3What is the file format? Just an opinion: Partition it by call_date and bucket by phone number, you will benefit from date filtering then.Partition by phone_number creats too many partitions and too many files – leftjoin Sep 16 '19 at 05:57
-
The file format is ORC. It’s not possible to change the partitioning and bucketing of the table now – Abhimanyu Mitra Sep 16 '19 at 18:04
-
2You can merge ORC files using ALTER TABLE concatenate: https://stackoverflow.com/a/50042837/2700344 – leftjoin Sep 16 '19 at 18:07
1 Answers
11
One way to control the size of files when inserting into a table using Hive, is to set the below parameters:
set hive.merge.tezfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=128000000;
set hive.merge.smallfiles.avgsize=128000000;
This will work for both M/R and Tez engine and will ensure that all files created are at or below 128 MB in size (you can alter that size number according to your use case. Additional reading here: https://community.cloudera.com/t5/Community-Articles/ORC-Creation-Best-Practices/ta-p/248963).
The easiest way to merge the files of the table is to remake it, while having ran the above hive commands at runtime:
CREATE TABLE new_table LIKE old_table;
INSERT INTO new_table select * from old_table;
In your case, for ORC tables you can concatenate the files after creation:
ALTER TABLE table_name [PARTITION (partition_key = 'partition_value')] CONCATENATE;

Theofanis
- 303
- 2
- 5
-
1Thank you for the answer, we are using the table remake process for now. – Abhimanyu Mitra Sep 25 '19 at 07:10
-
Excellent. Let me know if the answer can be improved, and please select it as correct if it was useful :) – Theofanis Sep 26 '19 at 10:18
-
Just 2 points: does this work with external tables? Compaction whilst writing to table or partition, can that cause an error to the writer? – thebluephantom Sep 09 '20 at 07:26