Hive query generating multiple small file

Question

I have 2 hive table as source. Say

DEV.INPUT_01
DEV.INPUT_02

I have 1 more table as DEV.TARGET. I want to load data into this table for above 2 input tables. The HQL which I have is:

insert overwrite table DEV.TARGET partition(c30)
select
   c1
  ,c2
  ,c3
  ,c4
  ,c5
  ,c6
  ,c7
  ,c8
  ,c9
  ,c10
  ,c11
  ,c12
  ,c13
  ,c14
  ,c15
  ,c16
  ,c17
  ,c18
  ,c19
  ,c20
  ,c21
  ,c22
  ,c23
  ,c24
  ,c25
  ,c26
  ,c27
  ,c28
  ,c29
  ,c30
from
  DEV.SOURCE_01 t1 left join
  DEV.SOURCE_02 t2 on
  t1.tab_id = t2.tab_id;

The query is working fine. Number of mapper are 700 and reducers are 400.

The problem is above query is generating 400 files per partition and size of every file is around 200K.

I have tried multiple parameter combinations:

Setting 1:

set hive.exec.reducers.bytes.per.reducer=256000000;

Result 1 Number of reducers decreased to 100 and hence 100 files generated per partition.

Setting 2

set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=256000000;
set hive.merge.smallfiles.avgsize=256000000;

Result 2 Above setting launched 2 MR steps and result is same.

Setting 3

set mapred.reduce.tasks=40;

Result 3

Number of files are reduced to 40 (which is expected)
Query performance degraded by 3 folds (original query to 20 mins and with this setting it took 55 mins).
The another problem is data size with this setting. As data grows this setting start degrading more and hence will be tough to manage.

Question How can I generate files of size 128M?

Do you run that query with TEZ or MapReduce? because TEZ has its own properties... — Samson Scharfrichter, Feb 10 '16 at 17:35
@SamsonScharfrichter It is mr. Cannot use tez. I want single file in orc. Compressed/uncompressed both are fine. — Ambrish, Feb 10 '16 at 17:36
Could be an issue specific to the version of Hive and the version of ORC SerDe that you are using -- cf. http://qnalist.com/questions/5993258/merge-small-orc-files for a similar problem and a possible workaround (ALTER TABLE CONCATENATE) — Samson Scharfrichter, Feb 10 '16 at 20:35
@SamsonScharfrichter, thanks for the pointer. There is another issue with this. Same is described here: http://stackoverflow.com/questions/33166387 — Ambrish, Feb 11 '16 at 06:07

score 0 · Answer 1 · edited May 23 '17 at 12:23

0

I don't think you can generate files of a specific size as Hive output. However you can achieve some part of it with partitioning

This SO question has the answer explaining how to split data across files

Hive -- split data across files

edited May 23 '17 at 12:23

Community

1
1

answered Feb 10 '16 at 17:45

hadooper

726
1
6
18

maxmithun · Answer 2 · 2017-10-20T07:59:19.393

Please set the following properties

set hive.optimize.index.filter=true; 
set hive.exec.orc.skip.corrupt.data=true;
set hive.vectorized.execution.enabled=true;
set hive.compute.query.using.stats=true;
set stats.reliable=true; 
set hive.optimize.sort.dynamic.partition=true;
set hive.optimize.ppd=true;
set hive.optimize.ppd.storage=true;
set hive.merge.mapredfiles=true;
set hive.merge.mapfile=true ;
set hive.hadoop.supports.splittable.combineinputformat=true;
set hive.exec.compress.output=true;

I have tried to find exactly which combination of setting worked for me. But all of them together only worked for me

score 0 · Answer 3 · answered May 14 '18 at 22:53

If you want to reduce the number of partitioning files in HDFS you need to limit the block size with the Hive parameters. For instance in the block size in the cluster is configured to 128M:

SET dfs.blocksize=134217728;

(Number above in binary) With that you will sort out the small partitioning file issue

Hive query generating multiple small file

3 Answers3