8

I recently upgraded from Elasticsearch 6 to 7 and stumbled across the 10000 hits limit.

Changelog, Documentation, and I also found a single blog post from a company that tried this new feature and measured their performance gains.

But I'm still not sure how and why this feature works. Or does it only improve performance under special circumstances?

Especially when sorting is involved, I can't get my head around it. Because (at least in my world) when sorting a collection you have to visit every document, and that's exactly what they are trying to avoid according to the Documentation: "Generally the total hit count can’t be computed accurately without visiting all matches, which is costly for queries that match lots of documents."

Hopefully someone can explain how things work under the hood and which important point I am missing.

Benjamin M
  • 23,599
  • 32
  • 121
  • 201
  • 2
    The whole history of that feature is described [in this Github issue](https://github.com/elastic/elasticsearch/issues/33028) – Val Mar 01 '21 at 05:09
  • @Val Thanks for the link. The first sentence of the issue sounds like that this feature only works for "sort by score". But I'm not entirely sure. However there's no explanation how it actually does work. – Benjamin M Mar 01 '21 at 09:37

1 Answers1

12

There are at least two different contexts in which not all documents need to be sorted:

A. When index sorting is configured, the documents are already stored in sorted order within the index segment files. So whenever a query specifies the same sort as the one in which the index was pre-sorted, then only the top N documents of each segment files need to be visited and returned. So in this case, if you are only interested in the top N results and you don't care about the total number of hits, you can simply set track_total_hits to false. That's a big optimization since there's no need to visit all the documents of the index.

B. When querying in the filter context (i.e. bool/filter) because no scores will be calculated. The index is simply checked for documents that match a yes/no question and that process is usually very fast. Since there is no scoring, only the top N matching documents are returned per shard.

If track_total_hits is set to false (because you don't care about the exact number of matching docs), then there's no need to count the docs at all, hence no need to visit all documents.

If track_total_hits is set to N (because you only care to know whether there are at least N matching documents), then the counting will stop after N documents per shard.

Relevant links:

Val
  • 207,596
  • 13
  • 358
  • 360
  • Thanks for the explanations. Still lot's of questionmarks on my mind... Point `A` seems reasonable if there's no filtering involved, because without filtering every document in the index is a match, though there's no need to count. Point `B` I (maybe?) don't fully understand: You mean, that the index simply returns the first `n` matching documents? But I think that's basically the same as Point `A`, because in both cases ES simply returns the first `n` matching documents, without any sorting involved. – Benjamin M Mar 03 '21 at 06:31
  • In case A, the index is already sorted as documents get indexed, so if the query sort is the same as the index sort, then there's no need to sort at query time, ES already knows the right order. – Val Mar 03 '21 at 06:36
  • 4
    What you need to understand is that `track_total_hits` is just one optimization in very specific cases. It's not a feature that can be applied on a broad set of cases thinking that it will speed up all kinds of queries. For instance, as soon as you have an aggregation, `track_total_hits` has no effect (only on the hits count) since all matching documents must be visited anyway to compute the aggregations – Val Mar 03 '21 at 06:38
  • What still keeps me wondering is the first paragraph of issue 33028: "skipping documents that do not produce competitive scores". How does this work? And most important: In which cases this feature does give **no** benefits? According to your explanation I'd say: query-time sorting (non-pre-sorted index), filtering with scripts (using `doc['foo']`), and maybe even more(?). Basically everything that involves query-time sorting or doc value access. – Benjamin M Mar 03 '21 at 06:40
  • The second part is correct, no benefit in case all documents need to be visited anyway. Regarding "competitive score", you might give [this link](https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand) a shot to see how the scoring can be done without visiting all docs and which also happen to explain in details how that relates to `track_total_hits` – Val Mar 03 '21 at 06:49
  • Wow. Sometimes you can't see the forest for the trees :D That MAXSCORE algorithm is super easy and makes perfectly sense. Now I have a better understanding how `track_total_hits` can improve performance, even when using filter. Thank you! – Benjamin M Mar 03 '21 at 07:12