2

I have a hive table which contains call data records(CDRs). I have the table partitioned on the phone number and bucketed on call_date. Now when I am inserting data into hive the back dated call_date are creating small files in my buckets which is creating name node metadata increase and performance slowdown. Is there a way to merge these small files into one.

Abhimanyu Mitra
  • 63
  • 1
  • 2
  • 8
  • 3
    What is the file format? Just an opinion: Partition it by call_date and bucket by phone number, you will benefit from date filtering then.Partition by phone_number creats too many partitions and too many files – leftjoin Sep 16 '19 at 05:57
  • The file format is ORC. It’s not possible to change the partitioning and bucketing of the table now – Abhimanyu Mitra Sep 16 '19 at 18:04
  • 2
    You can merge ORC files using ALTER TABLE concatenate: https://stackoverflow.com/a/50042837/2700344 – leftjoin Sep 16 '19 at 18:07

1 Answers1

11

One way to control the size of files when inserting into a table using Hive, is to set the below parameters:

set hive.merge.tezfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=128000000;
set hive.merge.smallfiles.avgsize=128000000;

This will work for both M/R and Tez engine and will ensure that all files created are at or below 128 MB in size (you can alter that size number according to your use case. Additional reading here: https://community.cloudera.com/t5/Community-Articles/ORC-Creation-Best-Practices/ta-p/248963).

The easiest way to merge the files of the table is to remake it, while having ran the above hive commands at runtime:

CREATE TABLE new_table LIKE old_table;
INSERT INTO new_table select * from old_table;

In your case, for ORC tables you can concatenate the files after creation:

ALTER TABLE table_name [PARTITION (partition_key = 'partition_value')] CONCATENATE;
Theofanis
  • 303
  • 2
  • 5