Can we define a methodology using which we can decide if we should go for bucketing or partitioning?
-
4Possible duplicate of [What is the difference between partitioning and bucketing a table in Hive ?](http://stackoverflow.com/questions/19128940/what-is-the-difference-between-partitioning-and-bucketing-a-table-in-hive) – Ankur Alankar Biswal Sep 16 '16 at 09:59
1 Answers
Usually Partitioning in hive offers a way of segregating hive table data into multiple files/directorys. But partitioning gives effective results when,
- There are limited number of partitions
- Comparatively equal sized partitions
But this may not possible in all scenarios, like when are partitioning our tables based geographic locations like country, some bigger countries will have large partitions(ex: 4-5 countries itself contributing 70-80% of total data) where as small countries data will create small partitions (remaining all countries in the world may contribute to just 20-30% of total data).So, In these cases Partitioning will not be ideal.
To overcome the problem of over partitioning, Hive provides Bucketing concept, another technique for decomposing table data sets into more manageable parts.
Bucketing concept is based on (hashing function on the bucketed column) mod (by total number of buckets).The hash_function depends on the type of bucketing column.
Records with the same bucketed column will always be stored in the same bucket and physically each bucket is just a file in the table directory and Bucket numbering is 1-based.
Bucketing works well when the field has high cardinality and data is evenly distributed among buckets. Partitioning works best when the cardinality of the partitioning field is not too high.

- 1,184
- 1
- 12
- 23