I have 2 hive table as source. Say
- DEV.INPUT_01
- DEV.INPUT_02
I have 1 more table as DEV.TARGET. I want to load data into this table for above 2 input tables. The HQL which I have is:
insert overwrite table DEV.TARGET partition(c30)
select
c1
,c2
,c3
,c4
,c5
,c6
,c7
,c8
,c9
,c10
,c11
,c12
,c13
,c14
,c15
,c16
,c17
,c18
,c19
,c20
,c21
,c22
,c23
,c24
,c25
,c26
,c27
,c28
,c29
,c30
from
DEV.SOURCE_01 t1 left join
DEV.SOURCE_02 t2 on
t1.tab_id = t2.tab_id;
The query is working fine. Number of mapper are 700 and reducers are 400.
The problem is above query is generating 400 files per partition and size of every file is around 200K.
I have tried multiple parameter combinations:
Setting 1:
set hive.exec.reducers.bytes.per.reducer=256000000;
Result 1 Number of reducers decreased to 100 and hence 100 files generated per partition.
Setting 2
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=256000000;
set hive.merge.smallfiles.avgsize=256000000;
Result 2 Above setting launched 2 MR steps and result is same.
Setting 3
set mapred.reduce.tasks=40;
Result 3
- Number of files are reduced to 40 (which is expected)
- Query performance degraded by 3 folds (original query to 20 mins and with this setting it took 55 mins).
- The another problem is data size with this setting. As data grows this setting start degrading more and hence will be tough to manage.
Question How can I generate files of size 128M?