9

For years I have been reading / hearing about the 'performance advantage' of database joins on bigint columns have OVER joins on (var)char columns.

Unfortunately, when looking for real answers / advice regarding to 'simlilar type questions':

  • The examples used are in a 'traditional' RDBMS context, like Mysql or Oracle / SQL Server. Take for instance this question or this example
  • The answer is quite old and the end-difference in runtime is not that great. Again, see this example

I have not seen an example using a version of Hive (preferably version 1.2.1 or higher) where a large (BIG-DATA-ISH) data set (let us say 500 million+ rows) is joined to an similar size dataset on:

  1. a Bigint column
  2. VERSUS a (var)Char(32) column.
  3. VERSUS a (var)Char(255) column.

I am choosing a size of 32 because it is the size of an MD5 Hash, converted to characters and 255 because it is 'in range' of the largest Natural Key I have ever seen.

Futhermore, I would expect Hive:

  • to run under the Tez engine
  • use an (compressed) file format like ORC + ZLip / Snappy

Does anyone know of such an example, substantiated with proof by showing Hive Explain plans, CPU, File & network resources + query runtimes?

  • My sense is that you could test this as fast as ask us! A data set of a few million (sparse and dense) values would be large enough to show up any difference. – Strawberry Mar 22 '18 at 18:22
  • Good suggestion, however at the time of asking this question I didn't have access to a system where I could really test this.... Now I'm working in Bigquery – Rogier Werschkull Apr 09 '18 at 15:16
  • 2
    Possible performance difference is negligible because of parallel execution. Too many other factors like cluster load and mapper /reducer parallelism degree are creating too much measurement uncertainty – leftjoin May 16 '18 at 15:29
  • 1
    (Opinion) I/O is the dominant part of the effort. The datatype is less critical. – Rick James Jun 04 '21 at 03:17

0 Answers0