I'm benchmarking ElasticSearch for very high indexing throughput purposes.
My current goal is to be able to index 3 billion (3,000,000,000) documents in a matter of hours.
For that purpose, I currently have 3 windows server machines, with 16GB RAM and 8 processors each.
The documents being inserted have a very simple mapping, containing only a handful of numerical non analyzed fields (_all
is disabled).
I am able to reach roughly 120,000 index requests per second (monitoring using big desk), using this relatively modest rig, and I'm confident that the throughput can be increased further. I'm using a number of .net NEST clients to send the index bulk requests, with 1500 index operations in bulk.
Unfortunately, the throughput of 120k requests per second does not last very long, and the rate diminishes gradually, dropping to ~15k after a couple of hours.
Monitoring the machines reveals that the cpu's are not the bottleneck. However, physical disk (not SSD) idle time seems to be dropping on all machines, reaching less than 15% avg idleness.
Setting refresh_interval
to 60s, than to 300s, and finally 15m, didn't seem to help much.
Spying on a single translog in a single shard, showed that the translog is flushed every 30 minutes, before reaching 200MB.
I have tried using two sharding strategies:
- 1 index, with 60 shards (no replicas).
- 3 indices, with 20 shards each (no replicas).
Both attempts result in rather similar experience, which i guess makes sense since it's the same number of shards.
Looking at the segments, I can see that most shards have ~30 committed segments, and similar number of searchable segments as well. Segment size varies. At one time, an attempt to optimize the index with max_num_segments=1, seemed to have help a little after it was finished (took a long while).
At any time, starting the whole ingestion process from the start, after deleting the used indices and creating new ones - result in the same behavior. Initially high index throughput, but gradually diminishing, long before reaching the goal of 3 billion documents. The index size in that time is about 120GB.
I'm using ElasticSearch 1.4 version. Xms and Xmx are configured for 8192MB, 50% of available memory. Indexing buffer is set to 30%.
My questions are as follows:
- Assuming that the disk is currently the bottleneck of this rig, is this phenomenon of gradually increasing disk utilization is a normal one? If not, what can be done to negate these effects?
- Are there any fine tuning that I can do to increase indexing throughput? Should I? or should I just scale out.