Is there a proven performance difference between joins in Hive on INT/BIGINT versus VARCHAR?

Question

For years I have been reading / hearing about the 'performance advantage' of database joins on bigint columns have OVER joins on (var)char columns.

Unfortunately, when looking for real answers / advice regarding to 'simlilar type questions':

The examples used are in a 'traditional' RDBMS context, like Mysql or Oracle / SQL Server. Take for instance this question or this example
The answer is quite old and the end-difference in runtime is not that great. Again, see this example

I have not seen an example using a version of Hive (preferably version 1.2.1 or higher) where a large (BIG-DATA-ISH) data set (let us say 500 million+ rows) is joined to an similar size dataset on:

a Bigint column
VERSUS a (var)Char(32) column.
VERSUS a (var)Char(255) column.

I am choosing a size of 32 because it is the size of an MD5 Hash, converted to characters and 255 because it is 'in range' of the largest Natural Key I have ever seen.

Futhermore, I would expect Hive:

to run under the Tez engine
use an (compressed) file format like ORC + ZLip / Snappy

Does anyone know of such an example, substantiated with proof by showing Hive Explain plans, CPU, File & network resources + query runtimes?

My sense is that you could test this as fast as ask us! A data set of a few million (sparse and dense) values would be large enough to show up any difference. — Strawberry, Mar 22 '18 at 18:22
Good suggestion, however at the time of asking this question I didn't have access to a system where I could really test this.... Now I'm working in Bigquery — Rogier Werschkull, Apr 09 '18 at 15:16
Possible performance difference is negligible because of parallel execution. Too many other factors like cluster load and mapper /reducer parallelism degree are creating too much measurement uncertainty — leftjoin, May 16 '18 at 15:29
(Opinion) I/O is the dominant part of the effort. The datatype is less critical. — Rick James, Jun 04 '21 at 03:17

Is there a proven performance difference between joins in Hive on INT/BIGINT versus VARCHAR?

0 Answers0

Linked