0

I was reading an article regarding how small files degrade the performance of the hive query. https://community.hitachivantara.com/community/products-and-solutions/pentaho/blog/2017/11/07/working-with-small-files-in-hadoop-part-1

I understand the first part regarding overloading the NameNode.

However, what he had said regrading map-reduce doesn't seem to happen. for both map-reduce and Tez.

When a MapReduce job launches, it schedules one map task per block of data being processed

I don't see mapper task created per file.May the reason is, he is referring the version 1 of map-reduce and so much change haver been done after that.

Hive Version: Hive 1.2.1000.2.6.4.0-91

My table:

create table temp.emp_orc_small_files (id int, name string, salary int)
stored as orcfile;

Data: following code will create 100 small files it containing only few kb of data.

 for i in {1..100}; do hive -e "insert into temp.emp_orc_small_files values(${i}, 'test_${i}', `shuf -i 1000-5000 -n 1`);";done

However I see only one mapper and one reducer task being created for following query.

[root@sandbox-hdp ~]# hive -e "select max(salary) from temp.emp_orc_small_files"
log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender.

Logging initialized using configuration in file:/etc/hive/2.6.4.0-91/0/hive-log4j.properties
Query ID = root_20180911200039_9e1361cb-0a5d-45a3-9c98-4aead46905ac
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1536258296893_0257)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 7.36 s
--------------------------------------------------------------------------------
OK
4989
Time taken: 13.643 seconds, Fetched: 1 row(s)

Same result with map-reduce.

hive> set hive.execution.engine=mr;
hive> select max(salary) from temp.emp_orc_small_files;
Query ID = root_20180911200545_c4f63cc6-0ab8-4bed-80fe-b4cb545018f2
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1536258296893_0259, Tracking URL = http://sandbox-hdp.hortonworks.com:8088/proxy/application_1536258296893_0259/
Kill Command = /usr/hdp/2.6.4.0-91/hadoop/bin/hadoop job  -kill job_1536258296893_0259
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-09-11 20:05:57,213 Stage-1 map = 0%,  reduce = 0%
2018-09-11 20:06:04,727 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.37 sec
2018-09-11 20:06:12,189 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.36 sec
MapReduce Total cumulative CPU time: 7 seconds 360 msec
Ended Job = job_1536258296893_0259
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 7.36 sec   HDFS Read: 66478 HDFS Write: 5 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 360 msec
OK
4989
Gaurang Shah
  • 11,764
  • 9
  • 74
  • 137
  • If you use ACID tables w/ ORC + Hive Streaming, then it'll compact the small files – OneCricketeer Sep 12 '18 at 22:45
  • this are normal ORC tables, without partitioning or bucketing. And I can see 100 files created after the insert statement. – Gaurang Shah Sep 12 '18 at 23:15
  • The number of files isn't really important, more the size of them... Might want to see https://hortonworks.com/tutorial/using-hive-acid-transactions-to-insert-update-and-delete-data/ Other than that, Tez needs to have a warm-up period, and running the same query multiple times in a row on the same table can yield different results – OneCricketeer Sep 13 '18 at 01:13
  • @cricket_007 the link you posted explains about transctional table, and this is not. – Gaurang Shah Sep 13 '18 at 01:41
  • I am just trying to understand why it isn't if you are going to complain about the performance difference of the small files, that is all – OneCricketeer Sep 13 '18 at 01:43
  • I am trying to understand if that statement `1 mapper task per block` is true in newer version of hive with either TEZ or MR. as I don't see it. I have 100 files (385 B) occupying one block each. – Gaurang Shah Sep 13 '18 at 02:54
  • 1
    Hive will internally use a `CombineFileInputFormat`, like that link you have mentions (see parts 2 and 3) for dealing with small files. See `hive.tez.input.format` or `hive.input.format` properties – OneCricketeer Sep 13 '18 at 03:03
  • @cricket_007 Thanks `hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat` now I understand what is happening. So small files is not that big an issue if the numbers are not huge. – Gaurang Shah Sep 13 '18 at 14:21
  • Please read also this answer about the number of mappers: https://stackoverflow.com/a/42842117/2700344 – leftjoin Nov 12 '18 at 09:53

1 Answers1

0

This is because the following configuration is taking effect

hive.hadoop.supports.splittable.combineinputformat

from the documentation

Whether to combine small input files so that fewer mappers are spawned.

So essentially Hive can infer that the input is a group of small files smaller than the blocksize and combine them reducing the required number of mappers.

hlagos
  • 7,690
  • 3
  • 23
  • 41