0

I have 2 hive table as source. Say

  • DEV.INPUT_01
  • DEV.INPUT_02

I have 1 more table as DEV.TARGET. I want to load data into this table for above 2 input tables. The HQL which I have is:

insert overwrite table DEV.TARGET partition(c30)
select
   c1
  ,c2
  ,c3
  ,c4
  ,c5
  ,c6
  ,c7
  ,c8
  ,c9
  ,c10
  ,c11
  ,c12
  ,c13
  ,c14
  ,c15
  ,c16
  ,c17
  ,c18
  ,c19
  ,c20
  ,c21
  ,c22
  ,c23
  ,c24
  ,c25
  ,c26
  ,c27
  ,c28
  ,c29
  ,c30
from
  DEV.SOURCE_01 t1 left join
  DEV.SOURCE_02 t2 on
  t1.tab_id = t2.tab_id;

The query is working fine. Number of mapper are 700 and reducers are 400.

The problem is above query is generating 400 files per partition and size of every file is around 200K.

I have tried multiple parameter combinations:

Setting 1:

set hive.exec.reducers.bytes.per.reducer=256000000;

Result 1 Number of reducers decreased to 100 and hence 100 files generated per partition.

Setting 2

set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=256000000;
set hive.merge.smallfiles.avgsize=256000000;

Result 2 Above setting launched 2 MR steps and result is same.

Setting 3

set mapred.reduce.tasks=40;

Result 3

  • Number of files are reduced to 40 (which is expected)
  • Query performance degraded by 3 folds (original query to 20 mins and with this setting it took 55 mins).
  • The another problem is data size with this setting. As data grows this setting start degrading more and hence will be tough to manage.

Question How can I generate files of size 128M?

Ambrish
  • 3,627
  • 2
  • 27
  • 42

3 Answers3

0

I don't think you can generate files of a specific size as Hive output. However you can achieve some part of it with partitioning

This SO question has the answer explaining how to split data across files

Hive -- split data across files

Community
  • 1
  • 1
hadooper
  • 726
  • 1
  • 6
  • 18
0

Please set the following properties

set hive.optimize.index.filter=true; 
set hive.exec.orc.skip.corrupt.data=true;
set hive.vectorized.execution.enabled=true;
set hive.compute.query.using.stats=true;
set stats.reliable=true; 
set hive.optimize.sort.dynamic.partition=true;
set hive.optimize.ppd=true;
set hive.optimize.ppd.storage=true;
set hive.merge.mapredfiles=true;
set hive.merge.mapfile=true ;
set hive.hadoop.supports.splittable.combineinputformat=true;
set hive.exec.compress.output=true;

I have tried to find exactly which combination of setting worked for me. But all of them together only worked for me

maxmithun
  • 1,089
  • 9
  • 18
0

If you want to reduce the number of partitioning files in HDFS you need to limit the block size with the Hive parameters. For instance in the block size in the cluster is configured to 128M:

SET dfs.blocksize=134217728;

(Number above in binary) With that you will sort out the small partitioning file issue