I have a hive table partitioned by a certain column, which is not changeable. This resulted in partitions being skewed; for example, some partitions can have 1 TB of data, whereas others can have <50 GB of data. Smaller partitions are not likely to be queried often. However, we plan to implement bucketing on this table with 128 buckets, causing the smaller partitions to have too many small files. This is counter-productive for smaller partitions but beneficial for larger partitions.
Is there a way we can select the number of buckets based on the partition size? In which case, we will select a lesser number of buckets for smaller partitions.
CONCATENATING buckets is one option for smaller files, but there are certain partitions where we need bucketing with reasonable bucket size, say 1 GB, instead of concatenating and turning off bucketing altogether. Please advise
CONCATENATING reference - How do I Combine or Merge Small ORC files into Larger ORC file?