Managing input split sizes in Hive running the tez engine

Question

I want to gain a better understanding of how in the input splits are calculated in the tez engine.

I am aware that the hive.input.format property can be set to either HiveInputFormat (default) or to CombineHiveInputFormat (generally accepted for large number of files having sizes << hdfs block size).

I was hoping someone could walk me through the differences on how HiveInputFormat and CombineHiveInputFormat calculate split sizes as data file sizes vary from small (lesser than a block) to large (spanning multiple blocks).

I want to dictate the number of mapper tasks that are spawned for scanning a table. For the MR engine this can be controlled by setting the mapred.min.split.size and mapred.max.split.size properties. I need to know if there are similar configurations for the tez engine.

Also the properties tez.grouping.max-size, tez.grouping.min-size and tez.grouping.split-waves have been set to the values of 1GB, 16MB and 1.7 respectively. However I observed that the created input splits do not adhere to these properties.

I had two files of size 3MB each for a table. According to the set properties, only 1 mapper task should have spawned but 2 mapper tasks spawned instead.

Are there other properties in hive/tez that need to be set to enable input split grouping?

I would highly appreciate any inputs.

Thanks!

I found that when using CombineHiveInputFormat we can dictate the number of mappers by setting the following properties `mapreduce.input.fileinputformat.split.minsize` and `mapreduce.input.fileinputformat.split.maxsize`. The confusion still holds for HiveInputFormat. Hoping someone could clear that up! — Nitin Kumar, Apr 27 '16 at 12:15

Managing input split sizes in Hive running the tez engine

0 Answers0