I'm using the following ngram tokenizer to process 15000 documents (and expect it to grow to up to a million documents), each with up to 6000 characters (avg 100-200) of text. 2-8 ngram is what I use for a catch-all approach because it needs to support all languages. 1 QPS should be sufficient (not a lot of concurrent users) so performance is not a priority as long as each search takes ~200 ms avg.
"tokenizer": {
"ngram_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 8,
"token_chars": [
"letter",
"digit"
]
}
}
The tokenizer needs to work with all languages including CJK, hence the need for ngram. Alternative is to use analyzer plugins for CJK languages (and maybe others) which will produce less tokens. But I prefer a one size fits all approach if at all possible.
The largest sample document produces up to 10000 tokens using above ngram, a bit over a megabyte in size. But if this is an issue, I can probably set a maximum for the text of each document the tokens are based on. While I only have around 15000 documents and search is sufficiently fast, I don't know how this scales with # of documents. Is this a reasonable amount? Does Elasticsearch have any documented recommendations/limits for max number of tokens?
Some more info: Memory optimized deployment (ES Cloud), 2 zones, 4GB Storage and 2GB RAM per zone, 162 shards. Memory pressure is around 30%.