ElasticSearch - high indexing throughput

Question

I'm benchmarking ElasticSearch for very high indexing throughput purposes.

My current goal is to be able to index 3 billion (3,000,000,000) documents in a matter of hours. For that purpose, I currently have 3 windows server machines, with 16GB RAM and 8 processors each. The documents being inserted have a very simple mapping, containing only a handful of numerical non analyzed fields (_all is disabled).

I am able to reach roughly 120,000 index requests per second (monitoring using big desk), using this relatively modest rig, and I'm confident that the throughput can be increased further. I'm using a number of .net NEST clients to send the index bulk requests, with 1500 index operations in bulk.

Unfortunately, the throughput of 120k requests per second does not last very long, and the rate diminishes gradually, dropping to ~15k after a couple of hours.

Monitoring the machines reveals that the cpu's are not the bottleneck. However, physical disk (not SSD) idle time seems to be dropping on all machines, reaching less than 15% avg idleness.

Setting refresh_interval to 60s, than to 300s, and finally 15m, didn't seem to help much. Spying on a single translog in a single shard, showed that the translog is flushed every 30 minutes, before reaching 200MB.

I have tried using two sharding strategies:

1 index, with 60 shards (no replicas).
3 indices, with 20 shards each (no replicas).

Both attempts result in rather similar experience, which i guess makes sense since it's the same number of shards.

Looking at the segments, I can see that most shards have ~30 committed segments, and similar number of searchable segments as well. Segment size varies. At one time, an attempt to optimize the index with max_num_segments=1, seemed to have help a little after it was finished (took a long while).

At any time, starting the whole ingestion process from the start, after deleting the used indices and creating new ones - result in the same behavior. Initially high index throughput, but gradually diminishing, long before reaching the goal of 3 billion documents. The index size in that time is about 120GB.

I'm using ElasticSearch 1.4 version. Xms and Xmx are configured for 8192MB, 50% of available memory. Indexing buffer is set to 30%.

My questions are as follows:

Assuming that the disk is currently the bottleneck of this rig, is this phenomenon of gradually increasing disk utilization is a normal one? If not, what can be done to negate these effects?
Are there any fine tuning that I can do to increase indexing throughput? Should I? or should I just scale out.

what's the process memory footprint over time? does throughput stabilize at 15k/s or does it keep falling? what is going to/from disk? (on linux, some of this available with ps or top, some with strace) — Andras, Dec 09 '14 at 22:06
I don't remember the exact memory footprint, will update tomorrow. However, i do remember a rather healthy jigsaw heap graph. The rate of indexing seem to stabilize at 15k/s, however it would take hours to verify that. On every machine, the elasticsearch service performs about 2MG/s write (initially - its much less when the rate fades), and when busy disk, 50 - 80 MG/s reads. — Roman, Dec 09 '14 at 22:22
Are you specifying the keys for the documents or are you allowing Elasticsearch to generate IDs automatically? Have you tried using fewer shards? — Christian Dahlqvist, Dec 09 '14 at 23:10
I'm specifying the keys, which are int64 ids. Did not try fewer shards yet. I chose 50 million documents to be the heuristically best fit for a shard of that type. Could try less shards, maybe it'll reduce IO. — Roman, Dec 09 '14 at 23:14
If you are specifying the document IDs, each insert will in reality be an update attempt which results in a index lookup as well as a write. This will generally get slower as the index size grows. If you can let Elasticsearch automatically generate the IDs, it will be able to write the documents directly without a lookup, resulting in less disk IO and improved indexing throughput. — Christian Dahlqvist, Dec 10 '14 at 12:10
Another reason you are seeing increasing disk IO as indexing progresses is that Lucene will be merging segments in the background. This can consume a lot of disk IO, which is why nodes that perform a lot of indexing generally benefit from SSDs. If this is not an option, your best bet will most likely be to scale out. — Christian Dahlqvist, Dec 10 '14 at 14:09
That's a great suggestion. Sadly, I must specify my own ids. But, out of curiosity, i'll be sure to test performance using elasticsearch autogenerated ids. And i'll also test fewer shards before scaling out. The index-update issue, should also be less of a problem with fewer shards, because then lucene would be able to fit more segments (of a particular shard) in memory, wouldn't you agree? — Roman, Dec 10 '14 at 22:38
Only a few numerical fields, no nesting, no objects. 100 - 200 bytes. — Roman, Dec 11 '14 at 06:06
@Christian Dahlqvist, thanks for the tip about what happens when you specify ids. It seems like the current implementation goes beyond what is necessary, and that indexing with ids could be done much more efficiently. Is this understanding mistaken? — Eric Walker, Sep 01 '15 at 20:20

Roman · Answer 1 · 2015-01-29T23:02:22.310

Long story short, I ended up with 5 virtual linux machines, 8 cpu, 16 GB, using puppet to deploy elasticsearch. My documents got a little bigger, but so did the throuhgput rate (slightly). I was able to reach 150K index requests / second on average, indexing 1 billion documents in 2 hours. Throughput is not constant, and i observed similar diminishing throughput behavior as before, but to a lesser extent. Since I will be using daily indices for same amount of data, I would expect these performance metrics to be roughly similar every day.

The transition from windows machines to linux was primarily due to convenience and compliance with IT conventions. Though i don't know for sure, I suspect the same results could be achieved on windows as well.

In several of my trials I attempted indexing without specifying document ids as Christian Dahlqvist suggested. The results were astonishing. I observed a significant throughput increase, reaching 300k and higher in some cases. The conclusion of this is obvious: Do not specify document ids, unless you absolutely have to.

Also, i'm using less shards per machine, which also contributed to throughput increase.

Thanks for sharing your research, @Roman. I think it's worth a re-check with 2.0 versions, since according to the updated "Performance Considerations for Elasticsearch 2.0 Indexing" - the auto id optimization was removed in 2.0. https://www.elastic.co/blog/performance-indexing-2-0 — mork, Jan 14 '16 at 09:13
Though your choice of doc id can still affect performance: http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html (this is quoted in your above article) — JCoster22, Jul 23 '16 at 13:12

ElasticSearch - high indexing throughput

1 Answers1

Linked