Hive buckets vs Partitioning

Question

Continuing from question: What is the difference between partitioning and bucketing a table in Hive ?

Suppose we have partitioned the employee table by Salary column. If we write a select query from this table with a condition in where clause that is based on salary column, then the query would run fast.. as it would only fetch details from a particular partition only.

However, rather than partitioning, if we bucket the same table based on salary column and set fixed buckets numbers. Then in this case if we write the same query, I would like to know how would that query benefit from buckets?? Can anyone please explain?

score 3 · Answer 1 · answered Nov 04 '15 at 14:50

One use case of bucketing is fast sampling of the data.

Imagine a table

CREATE TABLE employee(employeeId String, salary Double) 
CLUSTERED BY(user_id) INTO 10 BUCKETS;

We have bucketed not by salary but by employeeId, such that each employee is placed somewhat randomly into a bucket. We can say it is statistically likely that salaries are distributed fairly evenly throughout the buckets.

If we wanted to find out the average salary, we could run this query on a subset of the buckets, which is faster than running the query over the whole data set, but gives a good approximation.

Select Average(salary) FROM employee TABLESAMPLE(BUCKET 1 OUT OF 10 ON employeeId);

score 1 · Answer 2 · answered Dec 17 '15 at 07:34

as said by mattinbits, bucketing will be more useful if you bucket on employee id rather than salary.

And the number of buckets can be kept in a power of 2. like 2,4,8,16,32...

To decide how many buckets, you should consider the amount of data in one bucket=(total size of data/number of buckets) < (should be smaller than) the size of your memory.

and when you use a join with another table having buckets on the same column , with a number of buckets which is same or multiple of the number of buckets of this table,then the joins happen in the memory, or you can say map side join, which is much faster ..

You should choose partition when you will be using a lot of where clause, on the same column (partitioned column) , a good candidate for partitioned column can be : state,city , country, year, month, department,

Hive buckets vs Partitioning

2 Answers2