How yandex implemented 2 layered sharding

Question

In the clickhouse documentation, there is a mention of Yandex.Metrica, implementing Bi-Level sharding.

"Alternatively, as we've done in Yandex.Metrica, you can set up bi-level sharding: divide the entire cluster into "layers", where a layer may consist of multiple shards. Data for a single client is located on a single layer, but shards can be added to a layer as necessary, and data is randomly distributed within them."

Is there a detailed implementation for this sharding scheme, documented some place.

score 4 · Answer 1 · answered Jun 21 '19 at 07:16

Logically Yandex.Metrica has only one high-cardinality ID column that serves as main sharding key.

By default SELECTs from table with Distributed engine requests partial results from one replica of each shard. If you have like hundreds servers or more, it's a lot of network communication to query all shards (probably 1/2 or 1/3 of all servers) which might introduce more latency than the actual query execution. The reason for this behavior is that ClickHouse allows to write data directly to shards (bypassing Distributed engine and it's configured sharding key) and the application that does it is not forced to comply with sharding key of Distributed table (it can chose differently to spread data more evenly or by whatever other reason).

So the idea of that bi-level sharding is to split large cluster into smaller sub-clusters (10-20 servers each) and make most SELECT queries go through a Distributed tables that are configured against sub-clusters, thus making less network communication necessary and lowering the impact of possible stragglers. Global Distributed tables for whole large cluster is also configured for some ad-hoc or overview style queries, but they are not so frequent and have lower latency requirements. This still leaves freedom for the application that writes data to balance it arbitrarily between shards forming sub-cluster (by writing directly to them). But to make this all work together applications that write and read data need to have a consistent mapping from whatever high-cardinality ID is used (CounterID in case of Metrica) to sub-cluster ID and hostnames it consists of. Metrica stores this mapping in MySQL, but in other cases something else might look more applicable.

Alternative approach is to use "optimize_skip_unused_shards" setting that makes SELECT queries which have a condition on sharding key of Distributed table to skip shards that are not supposed to have data. It introduces the requirement for data to be distributed between shards exactly as if it was written through this Distributed table or the report will not include some misplaced data.

How yandex implemented 2 layered sharding

1 Answers1

Linked