For years I have been reading / hearing about the 'performance advantage' of database joins on bigint columns have OVER joins on (var)char columns.
Unfortunately, when looking for real answers / advice regarding to 'simlilar type questions':
- The examples used are in a 'traditional' RDBMS context, like Mysql or Oracle / SQL Server. Take for instance this question or this example
- The answer is quite old and the end-difference in runtime is not that great. Again, see this example
I have not seen an example using a version of Hive (preferably version 1.2.1 or higher) where a large (BIG-DATA-ISH) data set (let us say 500 million+ rows) is joined to an similar size dataset on:
- a Bigint column
- VERSUS a (var)Char(32) column.
- VERSUS a (var)Char(255) column.
I am choosing a size of 32 because it is the size of an MD5 Hash, converted to characters and 255 because it is 'in range' of the largest Natural Key I have ever seen.
Futhermore, I would expect Hive:
- to run under the Tez engine
- use an (compressed) file format like ORC + ZLip / Snappy
Does anyone know of such an example, substantiated with proof by showing Hive Explain plans, CPU, File & network resources + query runtimes?