i am bit new to hadoop. As per my knowledge buckets are fixed no. of partitions in hive table and hive uses the no. of reducers same as the total no. of buckets defined while creating the table. So can anyone tell me how to calculate the total no. of buckets in a hive table. Is there any formula for calculating the total number of buckets ?
-
Got a formula #buckets = (x * Average_partition_size) / JVM_memory_available_to_your_Hadoop_tasknode ; where x (>1) the "factor of conservatism". But not clear about it. Need a clear formula on this – Biswa Bandana Nayak Jun 09 '15 at 11:22
5 Answers
Lets take a scenario Where table size is: 2300 MB, HDFS Block Size: 128 MB
Now, Divide 2300/128=17.96
Now, remember number of bucket will always be in the power of 2.
So we need to find n such that 2^n > 17.96
n=5
So, I am going to use number of buckets as 2^5=32
Hope, It will help some of you.

- 111
- 1
- 3
From the documentation link
In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There's a '0x7FFFFFFF in there too, but that's not that important). The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets.

- 6,948
- 6
- 18
- 30
-
1Thanks. But my question is how can we decide the total no. of buckets for a hive table. – Biswa Bandana Nayak Jun 09 '15 at 17:23
-
Thanks. But my question is how can we decide the total no. of buckets for a hive table. For ex- CREATE EXTERNAL TABLE SALES_EXT_BUCKET (store_id STRING, order_no STRING, Order_Date STRING) CLUSTERED BY (order_no) INTO 4 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as textfile; So here how can we decide to have total no. of buckets as 4. I know total no. of buckets are always the power of 2 . Is there any formula for that ? if yes, then appreciate any inputs – Biswa Bandana Nayak Jun 09 '15 at 17:30
-
1I think you have seen [this](http://mail-archives.apache.org/mod_mbox/hive-user/201204.mbox/%3C350967547.114894.1335230199385.JavaMail.root@sms-zimbra-message-store-03.sms.scalar.ca%3E), to get that above formulae. IN short, it actually depends on your use case and how you want to query at later stage. [this](https://groups.google.com/forum/#!topic/chennaihug/2rfXrRnnphc) has a practical example which might help you. Happy learning – Ramzy Jun 09 '15 at 17:34
If you want to know how many buckets you should choose in your CLUSTER BY
clause, I believe it is good to choose a number that results in buckets that are at or just below your HDFS block size.
This should help avoid having HDFS allocate memory to files that are mostly empty.
Also choose a number that is a power of two.
You can check your HDFS block size with:
hdfs getconf -confKey dfs.blocksize

- 1
- 1

- 6,273
- 8
- 39
- 65
-
1Can you please elaborate your answer? Let's suppose we have 1TB of data, then how much buckets we can assign? Please explain me. – vijayraj34 Apr 13 '18 at 09:17
-
@vijayraj34 unfortunately it depends. You could start by checking your block size see [here](https://stackoverflow.com/questions/8411180/hadoop-fs-lookup-for-block-size?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa). From there you could naively try `num_buckets = 1TB/block-size`. There may be advantages to further considering the memory requirements of each datum (e.g. in case they are larger than the block size?), not totally sure. – conner.xyz Apr 13 '18 at 19:47
-
1One more question, How to choose the bucketing column? What if our columns cardinality value is 2? If it's not the right one to select then what should be the perfect cardinality value should we consider? – vijayraj34 Apr 14 '18 at 17:47
-
@vijayraj34 it comes down to a hash function. Read more [here](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables): `How does Hive distribute the rows across the buckets? In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. ...` – conner.xyz Apr 16 '18 at 17:03
optimal bucket number is ( B * HashTableSize of Table ) / Total Memory of Node, B=1.01

- 31
- 3
size of data/block size =answer 2^n compare with answer. closest N will br no. of buckets

- 19
- 4