ElasticSearch indexing with 450K documents - performance

Question

We have ElasticSearch (1.5) on AWS (t2.micro, 2 instances with 10GB SSD storage each) and a MySQL with ~450K fairly big/complex entities.

I'm using python to read from MySql, serialize to JSON and PUT to ElasticSearch. There are 10 threads working simultaneously, each PUTing bulk of 1000 documents at the time.

Total of ~450K (1.3GB) documents, it takes around 20min to process and send to ElasticSearch.

Problem is that only around 85% of them get indexed and rest are lost. When I reduce number of documents to ~100K they all get indexed.

Looking at ElasticSearch AWS monitor I can see CPU getting up to 100% while indexing, but it doesnt give any errors.

What is the best way to find out bottle here? I want it fast but cant afford losing any documents.

EDIT. I've run it again checking output of /_cat/thread_pool?v every few minutes. Indexed 390805 out of 441400. Out of thread_pool bellow:

host             ip            bulk.active bulk.queue bulk.rejected index.active index.queue index.rejected search.active search.queue search.rejected 
<host> x.x.x.x           1         22            84            0           0              0             0            0               0 
<host> x.x.x.x           1         11            84            0           0              0             0            0               0 
<host> x.x.x.x           1         29            84            0           0              0             0            0               0 
<host> x.x.x.x           1         13            84            0           0              0             0            0               0 
<host> x.x.x.x           0          0            84            0           0              0             0            0               0 
<host> x.x.x.x           1         17            84            0           0              0             0            0               0 
<host> x.x.x.x           0          0            84            0           0              0             0            0               0

EDIT 2

host             ip            bulk.active bulk.queue bulk.rejected index.active index.queue index.rejected search.active search.queue search.rejected 
<host> x.x.x.x           0          0            84            0           0              0             0            0               0

EDIT 3

$ curl https://xxxxx.es.amazonaws.com/_cat/thread_pool?v&h=id,host,ba,bs,bq,bqs,br,‌bl,bc,bmi,bma
[1] 15896
host             ip            bulk.active bulk.queue bulk.rejected index.active index.queue index.rejected search.active search.queue search.rejected 
<host> x.x.x.x           0          0            84            0           0              0             0            0               0

^^ copy/paste of what I'm getting back

EDIT 4

$ curl 'https://xxxxx.es.amazonaws.com/_cat/thread_pool?v&h=id,host,ba,bs,bq,bqs,br‌,‌bl,bc,bmi,bma'
<html><body><h1>400 Bad request</h1>
Your browser sent an invalid request.
</body></html>

still nothing

EDIT 5

id   host             ba bs bq bqs br bmi bma bl br    bc 
n6Ad <host>  0  1  0  50 84   1   1  1 84 25821

some mysterious way it worked when I changed order of params

Please update your question with the output you get from: `curl localhost:9200/_cat/thread_pool?v` If you see that the `bulk.rejected` column is > 0 then some of your bulk calls were rejected because you were sending them faster than your server could process. — Val, Oct 11 '16 at 07:58
Please add the following headers `curl localhost:9200/_cat/thread_pool?v&h=id,host,ba,bs,bq,bqs,br,bl,bc,bmi,bma` — Val, Oct 11 '16 at 08:24
You don't have the right columns in your second edit. try with quotes: `curl 'localhost:9200/_cat/thread_pool?v&h=id,host,ba,bs,bq,bqs,br,‌bl,bc,bmi,bma'` — Val, Oct 11 '16 at 08:40
84 bulk requests have been rejected, from what I can read, and that's also visible in your very first table, just that the alignment was not good. — Val, Oct 11 '16 at 09:05
You could take a look at Kinesis Firehose for ingesting data into ES — Karl Laurentius Roos, Oct 11 '16 at 12:38
Each document is 1.3GB? Or the entire collection (~450k) is 1.3GB? — Peter Dixon-Moses, Oct 12 '16 at 14:48
@KarlLaurentiusRoos, when Kinesis feeds Amazon Elasticsearch, does it have the ability throttle indexing based on cluster load? Brief look at the documentation didn't seem to indicate this. (And Amazon probably assumes most people want to scale the downstream to handle upstream volume vs throttling the upstream.) — Peter Dixon-Moses, Oct 14 '16 at 13:40
@PeterDixon-Moses to be honest, I have no idea. I ran some tests about half a year ago to prepare for a production scenario fairly similar to this, then we indexed about 200k documents in a few minutes which was about 10x the rate we were expecting. No issues at all and the ES cluster never became unresponsive. I think it might be a viable option for you. — Karl Laurentius Roos, Oct 14 '16 at 16:12

score 1 · Answer 1 · edited May 23 '17 at 10:27

I suspect EC2 is your bottleneck. Based on the way burstable instances are allocated CPU, a t2.micro accrues 6 cpu credits an hour.

So in your first hour up, your Elasticsearch nodes will be able to run one vCPU at 100% for a maximum of 6 minutes before being "capped" at an unspecified lower resource allocation (assume it is less than the instance quota of 10% of a vCPU).

Elasticsearch is most likely to be CPU-bound during indexing. If the indexing process is sending bulk requests faster than an instance can ingest (due to being CPU-bound... within quota, or outside of quota after those first 6 minutes of bursting), those requests will queue. Once the queue is saturated, Elasticsearch will begin rejecting requests.

This may help explain why you're able to index 100k documents without issue (under 6min?), while your full collection (~450k) encounters difficulty.

If your cluster is CPU-bound during indexing, and bulk requests are being rejected, you'll want to either:

Increase the compute resources available to your cluster nodes during indexing

OR
Throttle your indexer to keep pace with your cluster ingestion capacity.

You could build an indexer that was more resilient to smaller node-types, by running your thread_pool request to check how many bulk requests were in queue (perhaps as a percentage of the total queue size), before deciding to fire the next bulk indexing request.

ElasticSearch indexing with 450K documents - performance

1 Answers1