We have a clustered transactional table (10k buckets) which seems to be inefficient for the following two use cases
- merges with daily deltas
- queries based on date range.
What we want to do is to partition table by date and thus create partitioned clustered transactional table. Daily volume suggests number of buckets to be around 1-3, but inserting into the newly created table produces number_of_buckets reduce tasks which is too slow and causes some issues with merge on reducers due to limited hard drive.
Both issues are solvable (for instance, we could split the data into several chunks and start separate jobs to insert into the target table in parallel using n_jobs*n_buckets reduce tasks though it would result in several reads of the source table) but i believe there should be the right way to do that, so the question is: what is this right way?
P.S. Hive version: 1.2.1000.2.6.4.0-91