recommended max number of tokens? (scalability)

Question

I'm using the following ngram tokenizer to process 15000 documents (and expect it to grow to up to a million documents), each with up to 6000 characters (avg 100-200) of text. 2-8 ngram is what I use for a catch-all approach because it needs to support all languages. 1 QPS should be sufficient (not a lot of concurrent users) so performance is not a priority as long as each search takes ~200 ms avg.

 "tokenizer": {
    "ngram_tokenizer": {
      "type": "ngram",
      "min_gram": 2,
      "max_gram": 8,
      "token_chars": [
        "letter",
        "digit"
      ]
    }
  }

The tokenizer needs to work with all languages including CJK, hence the need for ngram. Alternative is to use analyzer plugins for CJK languages (and maybe others) which will produce less tokens. But I prefer a one size fits all approach if at all possible.

The largest sample document produces up to 10000 tokens using above ngram, a bit over a megabyte in size. But if this is an issue, I can probably set a maximum for the text of each document the tokens are based on. While I only have around 15000 documents and search is sufficiently fast, I don't know how this scales with # of documents. Is this a reasonable amount? Does Elasticsearch have any documented recommendations/limits for max number of tokens?

Some more info: Memory optimized deployment (ES Cloud), 2 zones, 4GB Storage and 2GB RAM per zone, 162 shards. Memory pressure is around 30%.

score 0 · Answer 1 · answered Apr 22 '22 at 06:05

In the beginning of question you mentioned million documents, but later you mentioned 15k so please clarify this, Your question is very similar to what I answered in my this stackOverflow answer, but I would add few more details.

It would help why you are using the n-gram and explain your use-case, so that we can explain alternatives if possible. Apart from this, definitely using ngram is costly and takes more CPU, memory, Disk and infra both at index and query time, and known to cause performance issues, but there are also various factors like your cluster size, index configuration(no of primary shards and replica shards), and how they are allocated in your Elasticsearch cluster.

It's very difficult to provide the specific recommendation unless you provide more information, also you need to do the benchmark testing for your dataset and your cluster, as every deployment is unique.

Bonus tips: You can use the profile API to know the execution details and finding the bottleneck in your query.

Hope this helps and let me know if you need more info.

I have provided some more info. My preference is ngram but I want to know if I will have scalability problems down the road (I am new to ES). — Morrowless, Apr 22 '22 at 11:27
@Morrowless, thanks for more information and infra sizing it helps, but i'm afraid to say your Elasticsearch cluster currently might be sufficient to support these 15k document, but you would definitely have performance issue, if you index 1M document without upscaling your Elasticsearch cluster, specific to your another question, no Elasticsearch As far as I know, doesn't provide any recommendation for no of tokens, however it does have a limit on max number of document in a shard and size of one token. — Amit, Apr 22 '22 at 11:49
@Morrowless another thing is, even your QPS is really less, but one bad query which causes too much heap usage, is sufficient to break your Elasticsearch cluster and cause ripple effect — Amit, Apr 22 '22 at 11:51

recommended max number of tokens? (scalability)

1 Answers1