Elasticsearch Debugging

Question

Our elasticsearch is a mess. The cluster health is always in red and ive decided to look into it and salvage it if possible. But I have no idea where to begin with. Here is some info regarding our cluster:

{
  "cluster_name" : "elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 6,
  "number_of_data_nodes" : 6,
  "active_primary_shards" : 91,
  "active_shards" : 91,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 201,
  "number_of_pending_tasks" : 0
}

The 6 nodes:

host               ip         heap.percent ram.percent load node.role master name
es04e.p.comp.net 10.0.22.63            30          22 0.00 d         m      es04e-es
es06e.p.comp.net 10.0.21.98            20          15 0.37 d         m      es06e-es
es08e.p.comp.net 10.0.23.198            9          44 0.07 d         *      es08e-es
es09e.p.comp.net 10.0.32.233           62          45 0.00 d         m      es09e-es
es05e.p.comp.net 10.0.65.140           18          14 0.00 d         m      es05e-es
es07e.p.comp.net 10.0.11.69            52          45 0.13 d         m      es07e-es

Straight away you can see I have a very large number of unassigned shards (201). I came across this answer and tried it and got 'acknowledged:true', but there was no change in the either of the above posted sets of info.

Next I logged into one of the nodes es04 and went through the log files. the first log file has a few lines that caught my attention

[2015-05-21 19:44:51,561][WARN ][transport.netty          ] [es04e-es] exception caught on transport layer [[id: 0xbceea4eb]], closing connection

and

[2015-05-26 15:14:43,157][INFO ][cluster.service          ] [es04e-es] removed {[es03e-es][R8sz5RWNSoiJ2zm7oZV_xg][es03e.p.sojern.net][inet[/10.0.2.16:9300]],}, reason: zen-disco-receive(from master [[es01e-es][JzkWq9qwQSGdrWpkOYvbqQ][es01e.p.sojern.net][inet[/10.0.2.237:9300]]])
[2015-05-26 15:22:28,721][INFO ][cluster.service          ] [es04e-es] removed {[es02e-es][XZ5TErowQfqP40PbR-qTDg][es02e.p.sojern.net][inet[/10.0.2.229:9300]],}, reason: zen-disco-receive(from master [[es01e-es][JzkWq9qwQSGdrWpkOYvbqQ][es01e.p.sojern.net][inet[/10.0.2.237:9300]]])
[2015-05-26 15:32:00,448][INFO ][discovery.ec2            ] [es04e-es] master_left [[es01e-es][JzkWq9qwQSGdrWpkOYvbqQ][es01e.p.sojern.net][inet[/10.0.2.237:9300]]], reason [shut_down]
[2015-05-26 15:32:00,449][WARN ][discovery.ec2            ] [es04e-es] master left (reason = shut_down), current nodes: {[es07e-es][etJN3eOySAydsIi15sqkSQ][es07e.p.sojern.net][inet[/10.0.2.69:9300]],[es04e-es][3KFMUFvzR_CzWRddIMdpBg][es04e.p.sojern.net][inet[/10.0.1.63:9300]],[es05e-es][ZoLnYvAdTcGIhbcFRI3H_A][es05e.p.sojern.net][inet[/10.0.1.140:9300]],[es08e-es][FPa4q07qRg-YA7hAztUj2w][es08e.p.sojern.net][inet[/10.0.2.198:9300]],[es09e-es][4q6eACbOQv-TgEG0-Bye6w][es09e.p.sojern.net][inet[/10.0.2.233:9300]],[es06e-es][zJ17K040Rmiyjf2F8kjIiQ][es06e.p.sojern.net][inet[/10.0.1.98:9300]],}
[2015-05-26 15:32:00,450][INFO ][cluster.service          ] [es04e-es] removed {[es01e-es][JzkWq9qwQSGdrWpkOYvbqQ][es01e.p.sojern.net][inet[/10.0.2.237:9300]],}, reason: zen-disco-master_failed ([es01e-es][JzkWq9qwQSGdrWpkOYvbqQ][es01e.p.sojern.net][inet[/10.0.2.237:9300]])
[2015-05-26 15:32:36,741][INFO ][cluster.service          ] [es04e-es] new_master [es04e-es][3KFMUFvzR_CzWRddIMdpBg][es04e.p.sojern.net][inet[/10.0.1.63:9300]], reason: zen-disco-join (elected_as_master)

In this section i realized there were a few nodes es01, es02, es03 which were deleted.

After this, all log files(around 30 of them) have only 1 line:

[2015-05-26 15:43:49,971][DEBUG][action.bulk              ] [es04e-es] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]

I have checked all the nodes and they have same version of ES and logstash. I realize this is a big complicated issues but if anyone can find out the issue and nudge me in the right direction it will be HUGE help

score 0 · Answer 1 · answered Aug 20 '15 at 19:39

I believe this might be because at some point you have a split brain issue and there were 2 versions of same shard in 2 clusters. One or both might have got different sets of data and 2 versions of shard might have come into existence. At some point you might have restarted the whole system and some shards might have gone to red state.

First see if there is data loss , if there is , aforementioned case could be the reason. Next make sure you set minimum master nodes to N/2+1 ( N is the number of shards ) , so that this issue wont surface again.

YOu can use the shard reroute API on the red shards and see if its moving out of red state. You might loose the shard data here , but then that is the the only way i have seen to being back the cluster state to green.

Data loss is not a big deal in this case. We realize that we will lose some. — Beginner, Aug 20 '15 at 20:11

score 0 · Answer 2 · answered Dec 30 '16 at 12:52

0

Please try to install Elastic-head plugin to check, to check shard status. you will able to see which shards are corrupted.

Try flush or optimize option.
Also restart Elastic sometime works.

answered Dec 30 '16 at 12:52

Amar Tari

1
2

Elasticsearch Debugging

2 Answers2