Every 15 minutes I need to insert datas to different tables stored as ORC and aggregates the values. Those INSERT use dynamic partitions. Each INSERT create a new file in the partition, which is slow down my aggregation queries. I've search on the web, found some subjects about this case like this one.
So I've added on the hive-site.xml those settings :
hive.merge.mapfiles =true;
hive.merge.mapredfiles =true;
hive.merge.tezfiles = true
hive.merge.smallfiles.avgsize=256000000;
But even with those settings, each insert create a new file on each partitions and files are not merged.
Is someone have an idea on how I can solve this issue ?
My cluster is an Azure HDInsight cluster 3.2, with Hive 0.14, Tez 0.5.2. My insert query is like this one :
INSERT INTO TABLE measures PARTITION(year, month, day)
SELECT area,
device,
date,
val,
year,
month,
day
FROM stagingmeasures
DISTRIBUTE BY year, month, day;
Thanks in advance