Hive clustered by on more than one column

Question

I understand that when the hive table has clustered by on one column, then it performs a hash function of that bucketed column and then puts that row of data into one of the buckets. And there is a file for each bucket i.e. if there are 32 buckets then there are 32 files in hdfs.

What does it mean to have the clustered by on more than one column? For example, lets say that the table has CLUSTERED BY (continent, country) INTO 32 BUCKETS.

How would the hash function be performed if there are more than one column?

How many files would be generated? Is this still 32?

score 7 · Accepted Answer · answered Jun 17 '15 at 15:58

7

Yes the number of files will still be 32.
Hash function will operate by considering "continent,country" as a single string and then will use this as input.

Hope it helps!!

answered Jun 17 '15 at 15:58

Maddy RS

1,031
8
9

Thanks @Maddy. So this also means that when join has to be performed on both the continent and the country? Will it have same performance benefits if the join is performed on only one column, say country? – Manikandan Kannan Jun 17 '15 at 17:08
2

If bucketing is applied on two columns as a composite one, then bucket map join will benefit only when your are using both columns as join criteria. This will not be helpful if you are using only one of them (i.e. country in this case) as a join critertia. – Maddy RS Jun 18 '15 at 08:24
Is this behaviour cannot be controlled by us as to what should happen?? – Utkarsh Saraf Apr 25 '17 at 10:41

score 0 · Answer 2 · answered Jun 16 '15 at 18:20

In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There's a '0x7FFFFFFF in there too, but that's not that important). The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets.

ref: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables

my question was more on the situation when there are two columns involved in the bucketing. — Manikandan Kannan, Jun 16 '15 at 18:23
Yes I understood the question. since we know that if we do not set hive.enforce.bucketing = true; the number of reduceres launched will be equal to number of buckets. so in this case yes it will create 32 only. — krishna.kadigari, Jun 16 '15 at 19:30

Hive clustered by on more than one column

2 Answers2

Linked