In our deployment, there are one thousand shards. The insertions are done via a distributed table with sharding jumpConsistentHash(colX, 1000)
. When I query for rows with colX=...
and turn on send_logs_level='trace'
, I see the query is sent to all shards and is executed on each shard. This is limiting our QPS (queries per second). Checking with Clickhouse document, it states:
SELECT queries are sent to all the shards and work regardless of how data is distributed across the shards (they can be distributed completely randomly).
When you add a new shard, you don’t have to transfer the old data to it.
You can write new data with a heavier weight – the data will be distributed slightly unevenly, but queries will work correctly and efficiently.
You should be concerned about the sharding scheme in the following cases:
* Queries are used that require joining data (IN or JOIN) by a specific key. If data is sharded by this key, you can use local IN or JOIN instead of GLOBAL IN or GLOBAL JOIN, which is much more efficient.
* A large number of servers is used (hundreds or more) with a large number of small queries (queries of individual clients - websites, advertisers, or partners).
In order for the small queries to not affect the entire cluster, it makes sense to locate data for a single client on a single shard.
Alternatively, as we’ve done in Yandex.Metrica, you can set up bi-level sharding: divide the entire cluster into “layers”, where a layer may consist of multiple shards.
Data for a single client is located on a single layer, but shards can be added to a layer as necessary, and data is randomly distributed within them.
Distributed tables are created for each layer, and a single shared distributed table is created for global queries.
It seems there is a solution for such small queries as in our case (the second bullet above), but I am not clear about the point. Does it mean when querying for a specific query with predicate colX=...
, I need to find the corresponding "layer" that contains its rows and then query on the corresponding distributed table for this layer?
Is there a way to query on the global distributed table for these small queries?