1

Every 15 minutes I need to insert datas to different tables stored as ORC and aggregates the values. Those INSERT use dynamic partitions. Each INSERT create a new file in the partition, which is slow down my aggregation queries. I've search on the web, found some subjects about this case like this one.

So I've added on the hive-site.xml those settings :

hive.merge.mapfiles =true;
hive.merge.mapredfiles =true;
hive.merge.tezfiles = true
hive.merge.smallfiles.avgsize=256000000;

But even with those settings, each insert create a new file on each partitions and files are not merged.

Is someone have an idea on how I can solve this issue ?

My cluster is an Azure HDInsight cluster 3.2, with Hive 0.14, Tez 0.5.2. My insert query is like this one :

INSERT INTO TABLE measures PARTITION(year, month, day)
SELECT  area,
    device,
    date,
    val,
    year,
    month,
    day
FROM stagingmeasures
DISTRIBUTE BY year, month, day;

Thanks in advance

Community
  • 1
  • 1
mklotz
  • 103
  • 1
  • 1
  • 5

2 Answers2

2

Create a swap table for your target table(like below) so that all the small files will be merged into a single file. Drop the original table and rename the swap table to original table.

Do this once in a day.

CREATE TABLE swap_measures SELECT * FROM measures;
DROP TABLE measures;
ALTER TABLE swap_measures RENAME TO measures;
goks
  • 1,196
  • 3
  • 18
  • 37
0

If your goal is to stream directly into Hive using the ORC format, look at the link below to read more on the Hive Streaming API

https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest

Hope this helps.

Rashim Gupta
  • 108
  • 5