4

I want to gain a better understanding of how in the input splits are calculated in the tez engine.

I am aware that the hive.input.format property can be set to either HiveInputFormat (default) or to CombineHiveInputFormat (generally accepted for large number of files having sizes << hdfs block size).

I was hoping someone could walk me through the differences on how HiveInputFormat and CombineHiveInputFormat calculate split sizes as data file sizes vary from small (lesser than a block) to large (spanning multiple blocks).

I want to dictate the number of mapper tasks that are spawned for scanning a table. For the MR engine this can be controlled by setting the mapred.min.split.size and mapred.max.split.size properties. I need to know if there are similar configurations for the tez engine.

Also the properties tez.grouping.max-size, tez.grouping.min-size and tez.grouping.split-waves have been set to the values of 1GB, 16MB and 1.7 respectively. However I observed that the created input splits do not adhere to these properties.

I had two files of size 3MB each for a table. According to the set properties, only 1 mapper task should have spawned but 2 mapper tasks spawned instead.

Are there other properties in hive/tez that need to be set to enable input split grouping?

I would highly appreciate any inputs.

Thanks!

Nitin Kumar
  • 765
  • 1
  • 11
  • 26
  • 1
    I found that when using CombineHiveInputFormat we can dictate the number of mappers by setting the following properties `mapreduce.input.fileinputformat.split.minsize` and `mapreduce.input.fileinputformat.split.maxsize`. The confusion still holds for HiveInputFormat. Hoping someone could clear that up! – Nitin Kumar Apr 27 '16 at 12:15

0 Answers0