Understanding the # of buckets for my SnappyData table?

Question

The default # of buckets is 113. Why? Why not 110? Does the bucket logic perform better with a certain "divisible by" value.

There are a lot of examples in SnappyData with less buckets. Why is that? What logic went into determining to use less buckets than the default 113?

What are the implications of choosing less? What about more buckets? I see a lot of logging in my Spark SQL queries looking for data at each bucket. Is it worse on performance of a query to have more buckets?

score 2 · Answer 1 · edited May 23 '17 at 12:08

2

Follow these guidelines to calculate the total number of buckets for the partitioned table:

Use a prime number. We use hashing function internally and this provides the most even distribution. Check this post for more details : Why use a prime number in hashCode?
Make it at least four times as large as the number of data stores you expect to have for the table. The larger the ratio of buckets to data stores, the more evenly the load can be spread across the members.

Note that there is a trade-off between load balancing and overhead, however. Managing a bucket introduces significant overhead, especially with higher levels of redundancy.

edited May 23 '17 at 12:08

Community

1
1

answered Aug 25 '16 at 07:51

Yogesh Mahajan

241
1
4

Can you clarify what you mean by "number of data stores"? Are you talking about total SnappyData Store servers or the REDUNDANCY property when I define the table DDL? – Jason Aug 25 '16 at 17:46
Yes it means the total servers configured (or expected to be configured if expanding the cluster in future). – Sumedh Aug 26 '16 at 14:54

score 1 · Answer 2 · edited Aug 25 '16 at 15:20

We have chosen a prime number which is most efficient in distributing data in a hash based partitioning logic. Number of buckets will have some impact on query performance. As buckets are translated to Spark tasks , there will be task scheduling overhead with a higher number of buckets.

But If your cluster has more capacity in terms of number of cpus, you should certainly try to match number of buckets to a near by prime number.

Understanding the # of buckets for my SnappyData table?

2 Answers2