Elasticsearch failed to recover after crash

Question

Ran out of diskspace and that screwed the elasticsearch shards. Three nodes are now in red, two got recovered and their state is yellow. ES is running 150% on CPU and high on memory, trying to recover them. But looks like there is some version match conflict.

I cleared up the disk space and deleted the translog for a shard to stop loading from translog. But surprisingly the translog gets created again!

Please share how can I stop this attempt to recover from translog and resume normal index operations. I do not want to delete the shard data.

[2014-10-31 03:11:43,742][WARN ][cluster.action.shard     ] [Angela Cairn] [western_europe][4] sending failed shard for [western_europe][4], node[x5M73qVXS5eZIBdz40boEg], [P], s[INITIALIZING], indexUUID [wy-tIJqdQiynz5SGQ2IrGA], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[western_europe][4] failed to recover shard]; nested: ElasticsearchException[failed to read [tweet][527924645014818817]]; nested: ElasticsearchIllegalArgumentException[No version type match [101]]; ]]
[2014-10-31 03:11:43,742][WARN ][cluster.action.shard     ] [Angela Cairn] [western_europe][4] received shard failed for [western_europe][4], node[x5M73qVXS5eZIBdz40boEg], [P], s[INITIALIZING], indexUUID [wy-tIJqdQiynz5SGQ2IrGA], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[western_europe][4] failed to recover shard]; nested: ElasticsearchException[failed to read [tweet][527924645014818817]]; nested: ElasticsearchIllegalArgumentException[No version type match [101]]; ]]
[2014-10-31 03:11:43,859][WARN ][indices.cluster          ] [Angela Cairn] [western_europe][2] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [western_europe][2] failed to recover shard
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:269)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.ElasticsearchException: failed to read [tweet][527936245440065536]
    at org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:511)
    at org.elasticsearch.index.translog.TranslogStreams.readTranslogOperation(TranslogStreams.java:52)
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:241)
    ... 4 more
Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: No version type match [116]
    at org.elasticsearch.index.VersionType.fromValue(VersionType.java:307)
    at org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:508)

I don't think you can recover without loosing any data from the translog. Try deleting the translog from both shards specified in the log: shard 4 and 2 for index western_europe. — Andrei Stefan, Oct 31 '14 at 06:24

score 6 · Answer 1 · answered Aug 10 '15 at 02:52

First, check there really are no issues with the shards themselves. cd to yout /usr/share/elasticsearch/lib directory or equivalent, and use Lucene's CheckIndex like so:

java -cp "*" -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /var/lib/elasticsearch/<ES-NAME>/nodes/<NODE-NUMBER>/indices/<INDEX-NAME>/<SHARD-NUMBER/index/

This will check a shard for problems, and will take a while if your shards are large.

Be aware that if you get the Java classpath wrong, some required jar files will be missing and CheckIndex may throw errors and wrongly claim all of the segments in the shard are broken, so read the output carefully.

If there are problems with a shard, and you have no other way to restore it, running the same command with the -fix argument will fix the shard but you will lose data. CheckIndex will warn you how many documents (if any) you stand to lose from the shard.

If CheckIndex reports all is well with the shard, then hopefully your problem is only in the translog. The transaction log is a write-ahead log which ElasticSearch uses for atomicity. After a crash, ES will attempt to restore a shard, including writes which had not been flushed to the shard index itself yet. These are in the translog, so you will lose them if you delete it. That, however, is much better than losing the shard. In your case, the translog already appears corrupt, and I don't know of any way to recover it.

To remove the corrupted transaction log being used for recovery, just delete the translog by removing the translog files in /var/lib/elasticsearch/<ES-NAME>/nodes/<NODE-NUMBER>/indices/<INDEX-NAME>/<SHARD-NUMBER>/translog/ for each relevant shard for each affected node. The latter part is important because you may be seeing the cluster attempt to regenerate a shard's translog from another node after you delete it from one.

The shards should then initialise correctly, although as usual that may take a while to complete.

Thanks a LOT. This is a very straightforward checklist for beginners! — ivspenna, Jul 14 '17 at 17:09

Elasticsearch failed to recover after crash

1 Answers1

Linked