0

In our deployment, there are one thousand shards. The insertions are done via a distributed table with sharding jumpConsistentHash(colX, 1000). When I query for rows with colX=... and turn on send_logs_level='trace', I see the query is sent to all shards and is executed on each shard. This is limiting our QPS (queries per second). Checking with Clickhouse document, it states:

SELECT queries are sent to all the shards and work regardless of how data is distributed across the shards (they can be distributed completely randomly). 
When you add a new shard, you don’t have to transfer the old data to it. 
You can write new data with a heavier weight – the data will be distributed slightly unevenly, but queries will work correctly and efficiently.

You should be concerned about the sharding scheme in the following cases:

* Queries are used that require joining data (IN or JOIN) by a specific key. If data is sharded by this key, you can use local IN or JOIN instead of GLOBAL IN or GLOBAL JOIN, which is much more efficient.
* A large number of servers is used (hundreds or more) with a large number of small queries (queries of individual clients - websites, advertisers, or partners). 
In order for the small queries to not affect the entire cluster, it makes sense to locate data for a single client on a single shard. 
Alternatively, as we’ve done in Yandex.Metrica, you can set up bi-level sharding: divide the entire cluster into “layers”, where a layer may consist of multiple shards. 
Data for a single client is located on a single layer, but shards can be added to a layer as necessary, and data is randomly distributed within them. 
Distributed tables are created for each layer, and a single shared distributed table is created for global queries.

It seems there is a solution for such small queries as in our case (the second bullet above), but I am not clear about the point. Does it mean when querying for a specific query with predicate colX=..., I need to find the corresponding "layer" that contains its rows and then query on the corresponding distributed table for this layer?

Is there a way to query on the global distributed table for these small queries?

Hieu Nguyen
  • 382
  • 2
  • 15
  • did you consider using the setting [optimize_skip_unused_shards](https://clickhouse.tech/docs/en/operations/settings/settings/#optimize-skip-unused-shards) and others shards-related ones? – vladimir Feb 18 '21 at 22:58
  • look at https://stackoverflow.com/questions/56589290/how-yandex-implemented-2-layered-sharding – vladimir Feb 18 '21 at 23:06
  • Use `optimize_skip_unused_shards` works. Thank you very much! It would be best if the document refers to this setting in the above paragraph. – Hieu Nguyen Feb 19 '21 at 07:10
  • 1
    feel free to edit the official documentation of ClickHouse - it is open to doing it – vladimir Feb 19 '21 at 18:01

0 Answers0