0

We have a clustered transactional table (10k buckets) which seems to be inefficient for the following two use cases

  • merges with daily deltas
  • queries based on date range.

What we want to do is to partition table by date and thus create partitioned clustered transactional table. Daily volume suggests number of buckets to be around 1-3, but inserting into the newly created table produces number_of_buckets reduce tasks which is too slow and causes some issues with merge on reducers due to limited hard drive.

Both issues are solvable (for instance, we could split the data into several chunks and start separate jobs to insert into the target table in parallel using n_jobs*n_buckets reduce tasks though it would result in several reads of the source table) but i believe there should be the right way to do that, so the question is: what is this right way?

P.S. Hive version: 1.2.1000.2.6.4.0-91

Slava
  • 1
  • 2
  • Please have a look at this: https://stackoverflow.com/a/42842117/2700344 and this https://stackoverflow.com/a/51061613/2700344 and https://stackoverflow.com/a/48296562/2700344 – leftjoin Sep 10 '18 at 15:24
  • @leftjoin, thanks, but it looks like `hive.enforce.bucketing=true` overrides `hive.exec.reducers.bytes.per.reducer` which is already set to ~64Mb. I'm not sure that i should change `mapreduce.job.reduces` either due to bucketing, are you? The keys should not be skewed cause the tuple is unique and all reducers work anyway. So that doesn't seem to be it. – Slava Sep 10 '18 at 17:09
  • Yes. the number of reducers = the number of buckets. Source: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables Have you tried to do without buckets? You can use orc files instead and increase reducers – leftjoin Sep 10 '18 at 17:19
  • I haven't tried to do that without buckets, but [tables must be bucketed](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-Limitations) for transactions to work which is kind of a primary requirement. The underlying file format is already `orc`. – Slava Sep 10 '18 at 18:27
  • it's never a good idea to start with `10k buckets`. could you shrink it down to `100` ? – Gaurang Shah Sep 10 '18 at 19:02
  • could you explain this statement `but inserting into the newly created table produces number_of_buckets reduce jobs` – Gaurang Shah Sep 10 '18 at 19:04
  • @gaurang, the number of reduce tasks seems to be equal to the number of buckets (sorry, it should have been *tasks* instead of the *jobs*). 10k buckets are kind of a legacy i have. Yes, i could shrink it down to 100 but i don't get how it could help me to increase the parallelism of the final reduce stage. – Slava Sep 10 '18 at 19:33
  • @Slave too many buckets will hamper your performance.. – Gaurang Shah Sep 10 '18 at 23:09

0 Answers0