2

In Apache Hive, how does the directory structure looks after a huge dataset is being partitioned and then bucketed?

For Ex - I have a customer dataset for a country, the data is being partitioned by state and then bucketed by city. How do we get to know how many files will be present in a city bucket?

Mahesh Khatai
  • 81
  • 1
  • 3

2 Answers2

1

A partition is a directory, and each partition corresponds to a specific value of the partitioned column.

Within a table or a partition/directory, buckets are organized as files. The number of buckets is predefined when creating a table with CLUSTERED BY (sth) INTO K BUCKETS. There will be ONE file for each individual bucket. Hive assigns records to buckets based on their hash value calculated by the bucketed column, and a mod is taken by the num of buckets K.

damientseng
  • 533
  • 2
  • 19
0

Maximum number of bucketing is 256 . For more details kindly refer below link:

[What is the difference between partitioning and bucketing a table in Hive ?

saravanatn
  • 630
  • 5
  • 9