Elasticsearch drops too many requests -- would a buffer improve things?

Question

We have a cluster of workers that send indexing requests to a 4-node Elasticsearch cluster. The documents are indexed as they are generated, and since the workers have a high degree of concurrency, Elasticsearch is having trouble handling all the requests. To give some numbers, the workers process up to 3,200 tasks at the same time, and each task usually generates about 13 indexing requests. This generates an instantaneous rate that is between 60 and 250 indexing requests per second.

From the start, Elasticsearch had problems and requests were timing out or returning 429. To get around this, we increased the timeout on our workers to 200 seconds and increased the write thread pool queue size on our nodes to 700.

That's not a satisfactory long-term solution though, and I was looking for alternatives. I have noticed that when I copied an index within the same cluster with elasticdump, the write thread pool was almost empty and I attributed that to the fact that elasticdump batches indexing requests and (probably) uses the bulk API to communicate with Elasticsearch.

That gave me the idea that I could write a buffer that receives requests from the workers, batches them in groups of 200-300 requests and then sends the bulk request to Elasticsearch for one group only.

Does such a thing already exist, and does it sound like a good idea?

https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-indexing-speed.html — hamid bayat, Aug 03 '19 at 07:19
maybe the write thread_pool is empty but the search thread_pool is full? they are using shared resources. — hamid bayat, Aug 03 '19 at 07:20
you have four nodes and you should not have more than 2 shards per index if you are using a replica. — hamid bayat, Aug 03 '19 at 07:22
@hamidbayat My 4 nodes have quite a few indices on them. But these workers only write to two of them. Here are the numbers: the first one has 1 shard with 1 replica, the second one has 4 shards with 1 replica each. I suspect the second one is the one causing problems. The second one is quite big, at 5.2TB in total. Do you suggest adding more nodes? — rubik, Aug 03 '19 at 07:42
In my second index each shard is more than 500GB, I think that's part of the problem. — rubik, Aug 03 '19 at 08:12
@rubik Did you find any solution? How can I troubleshoot this issue? Is there a common algorithm, way to do it? — Dmytro Chasovskyi, Nov 18 '20 at 14:18
@DmytroChasovskyi I ended up batching requests as I outlined in the post and it alleviated the issue. Then I had to scale the cluster. — rubik, Nov 18 '20 at 15:19
@DmytroChasovskyi @rubik Yes, elasticdump (much like e.g. Logstash) leverages the [`_bulk` API](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html) in order to play nice with ES and optimize the resources. By cleverly leveraging the bulk API, it's rarely necessary to do any tweaking on the thread pools. — Val, Nov 18 '20 at 15:43

Amit · Accepted Answer · 2020-11-18T15:51:46.237

12

First of all, it's important to understand what happens behind the scene when you send the index request to Elasticsearch, to troubleshoot the issue or finding the root-cause.

Elasticsearch has several thread pools but for indexing requests(single/bulk) write threadpool is being used, please check this according to your Elasticsearch version as Elastic keeps on changing the threadpools(earlier there was a separate threadpool for single and bulk request with different queue capacity).

In the latest ES version(7.10) write threadpool's queue capacity increased significantly to 10000 from 200(exist in earlier release), there may be below reasons to do it.

Elasticsearch now prefers to buffer more indexing requests instead of rejecting the requests.
Although increasing queue capacity means more latency but it's a trade-off and this will reduce the data-loss if the client doesn't have the retry mechanism.

I am sure, you would have not moved to ES 7.9 version, when capacity was increased, but you can increase the size of this queue slowly and allocate more processors(if you have more capacity) easily through the config change mentioned in this official example. Although this is a very debatable topic and a lot of people consider this as a band-aid solution than the proper fix, but now as Elastic themself increased the queue size, you can also try it, and if you have a short duration of increased traffic than it makes even more sense.

Another critical thing is to find out the root cause why your ES nodes are queuing up more requests, it can be legitimate like increasing indexing traffic and infra reached its limit. but if it's not legitimate you can have a look at my short tips to improve one-time indexing performance and overall indexing performance, by implementing these tips you will get a better indexing rate which will reduce the pressure on write thread pool queue.

Edit: As mentioned by @Val in the comment, if you are also indexing docs one by one then moving to bulk index API will give you the biggest boost.

edited Nov 18 '20 at 15:51

answered Nov 18 '20 at 15:32

Amit

30,756
6
57
88

2

I think the key point was that the OP was not using the bulk API, but was indexing every single document one by one, which is not optimal in terms of threading and networking. By cleverly leveraging the bulk API, it's rarely necessary to tweak the thread pools. – Val Nov 18 '20 at 15:45
@Val thanks for your comment, I agree, and using the bulk API is suggested in the tips which I shared, if you notice this question is now promoted by another user who didn't mention how he is indexing the data, so wrote a general answer and also mentioned that its debatable but if you have exhausted all other option, this is also one of the trade-off which you can make. – Amit Nov 18 '20 at 15:50
@Val, btw Elastic also increased the size of queue directly from 200 to 10k :D – Amit Nov 18 '20 at 15:50
1

Ok, fair enough, I was just mentioning this because the original OP (rubik) answered the new OP (DmytroChasovskyi) in the general comments :-) – Val Nov 18 '20 at 15:52
@Val, Also added the section on bulk API in answer – Amit Nov 18 '20 at 15:52
1

No worries, everything is great :-) – Val Nov 18 '20 at 15:53
@Val, yeah I noticed that but new OP still have to confirm how he is indexing the documents :), anyway do you have some info why Elastic drastically changed the size from 200 to 10k?? – Amit Nov 18 '20 at 15:55
Yes, that was done [here](https://github.com/elastic/elasticsearch/pull/59559), the reason was to allow a greater number of pending indexing requests since they added [additional memory limits](https://github.com/elastic/elasticsearch/issues/59263) – Val Nov 18 '20 at 16:50
1

Only a note: the queue size of the thread_pool write was already increased in the version 7.9 of Elasticsaerch (not 7.10). – Briomkez Nov 18 '20 at 19:38
@ElasticsearchNinja Thx, for the reply. In my case, it is ElasticSearch 7.4, which I have no control over. I have already utilized bulk API, and the issue is that it is still time to time throttle when the number of elements exceeds 500k in total size. It fails randomly, and I empirically found the best size of batch equaled to 5 elements. I am curious if I better ask this question separately as your answer helpful but doesn't solve my issue 100%? – Dmytro Chasovskyi Nov 18 '20 at 20:59
1

@Briomkez, yeah I know, and that's why I added (exist in earlier release) just after the line, I guess it was not clear, so thanks for explicitly mentioning it, will rephrase it. – Amit Nov 19 '20 at 01:49
1

@DmytroChasovskyi, if you have already utilized bulk_api and still facing issues, there are some more performance improvement you can do, which is mentioned in my other links, did you get a chance to try them ? – Amit Nov 19 '20 at 06:56
1

@ElasticsearchNinja, yes. Thx, you helped a lot! – Dmytro Chasovskyi Nov 19 '20 at 20:44
@DmytroChasovskyi, glad I was helpful :) – Amit Nov 20 '20 at 00:36

Elasticsearch drops too many requests -- would a buffer improve things?

1 Answers1

Linked