I want to gain a better understanding of how in the input splits are calculated in the tez engine.
I am aware that the hive.input.format property can be set to either HiveInputFormat (default) or to CombineHiveInputFormat (generally accepted for large number of files having sizes << hdfs block size).
I was hoping someone could walk me through the differences on how HiveInputFormat and CombineHiveInputFormat calculate split sizes as data file sizes vary from small (lesser than a block) to large (spanning multiple blocks).
I want to dictate the number of mapper tasks that are spawned for scanning a table. For the MR engine this can be controlled by setting the mapred.min.split.size and mapred.max.split.size properties. I need to know if there are similar configurations for the tez engine.
Also the properties tez.grouping.max-size, tez.grouping.min-size and tez.grouping.split-waves have been set to the values of 1GB, 16MB and 1.7 respectively. However I observed that the created input splits do not adhere to these properties.
I had two files of size 3MB each for a table. According to the set properties, only 1 mapper task should have spawned but 2 mapper tasks spawned instead.
Are there other properties in hive/tez that need to be set to enable input split grouping?
I would highly appreciate any inputs.
Thanks!