Courier Fetch: shards failed

Question

Why do I get these warnings after adding more data to my elasticsearch? And the warnings are different every time I browse the dashboard.

"Courier Fetch: 30 of 60 shards failed."

Example 1

Example 2

More details:

It's a sole node on a CentOS 7.1

/etc/elasticsearch/elasticsearch.yml

index.number_of_shards: 3
index.number_of_replicas: 1

bootstrap.mlockall: true

threadpool.bulk.queue_size: 1000
indices.fielddata.cache.size: 50%
threadpool.index.queue_size: 400
index.refresh_interval: 30s

index.number_of_shards: 5
index.number_of_replicas: 1

/usr/share/elasticsearch/bin/elasticsearch.in.sh

ES_HEAP_SIZE=3G

#I use this Garbage Collector instead of the default one.

JAVA_OPTS="$JAVA_OPTS -XX:+UseG1GC"

cluster status

{
  "cluster_name" : "my_cluster",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 61,
  "active_shards" : 61,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 61
}

cluster details

{
  "cluster_name" : "my_cluster",
  "nodes" : {
    "some weird number" : {
      "name" : "ES 1",
      "transport_address" : "inet[localhost/127.0.0.1:9300]",
      "host" : "some host",
      "ip" : "150.244.58.112",
      "version" : "1.4.4",
      "build" : "c88f77f",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 7854,
        "max_file_descriptors" : 65535,
        "mlockall" : false
      }
    }
  }
}

I'm curious about the "mlockall" : false because on the yml I did write bootstrap.mlockall: true

logs

lots of lines like:

org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.search.action.SearchServiceTransportAction$23@a9a34f5

score 26 · Answer 1 · answered Sep 03 '15 at 14:21

For me tuning the threadpool search queue_size solved the issue. I tried a number of other things and this is the one that solved it.

I added this to my elasticsearch.yml

threadpool.search.queue_size: 10000

and then restarted elasticsearch.

Reasoning... (from the docs)

A node holds several thread pools in order to improve how threads memory consumption are managed within a node. Many of these pools also have queues associated with them, which allow pending requests to be held instead of discarded.

and for search in particular...

For count/search operations. Defaults to fixed with a size of int((# of available_processors * 3) / 2) + 1, queue_size of 1000.

For more information you can refer to the elasticsearch docs here...

I had trouble finding this information so I hope this helps others!

thanks it worked for me. The config key is thread_pool.search.queue_size not threadpool.search.queue_size — Arslan Mehboob, Sep 13 '17 at 09:40

score 9 · Answer 2 · answered Feb 22 '18 at 14:43

9

I got this error when my query was missing a closing quote:

field:"value

In my ElasticSearch logs I see these exceptions:

Caused by: org.elasticsearch.index.query.QueryShardException:
    Failed to parse query [field:"value]
...
Caused by: org.apache.lucene.queryparser.classic.ParseException: 
    Cannot parse 'field:"value': Lexical error at line 1, column 13.  
    Encountered: <EOF> after : "\"value"

answered Feb 22 '18 at 14:43

spiffytech

6,161
7
41
57

Is this a question rather than an answer? Check out the Kibana query you are using, it appears that is not properly "quoted". `Failed to parse query [field:"value]`. Can you give more details? – Carlos Vega Feb 23 '18 at 15:10
4

This is an answer; this error can occur because of a bad query, not just queue_size etc. like other answers suggest. – spiffytech Feb 25 '18 at 02:01
1

exactly. any malformed query results in this warning printed on Kibana – asgs Jan 18 '19 at 13:09

score 7 · Answer 3 · answered Jun 12 '17 at 05:16

7

Using Elasticsearch 5.4 thread_pool has an underscore it it.

thread_pool.search.queue_size: 10000

See documentation at Elasticsearch Thread Pool module documentation

answered Jun 12 '17 at 05:16

Todd Cooper

71
1
3

score 4 · Accepted Answer · answered May 05 '15 at 13:26

4

This is likely an indication that there's a problem with your cluster's health. Without knowing more about your cluster, there's not much more that can be said.

answered May 05 '15 at 13:26

Alcanzar

16,985
6
42
59

I don't know which details of my cluster could be useful for solving this problem. Any ideas? It's just a sole node. I'm gonna add more details to the question. – Carlos Vega May 05 '15 at 15:53
1

you are going to need to show cluster status, memory allocated to the cluster, file descriptors available, OS, etc. Look in the elasticsearch log also to see if there's anything obvious there (like out of memory, too many open files, etc) – Alcanzar May 05 '15 at 15:57
I added more details. About those exceptions, maybe I would need to increase the some threadpools or something in the yml file. Thanks for your help. – Carlos Vega May 05 '15 at 16:07
based on what you've posted, your file descriptors should be fine (65535) – Alcanzar May 05 '15 at 16:10
1

also on a single node system, it's pointless to have replicas because the shards never get assigned, so you probably want to update your index mappings to have 0 replicas (that's a setting you can change). Also you have `index.number_of_shards` in there twice which means the second value is going to be used (although it doesn't matter after an index is already created) – Alcanzar May 05 '15 at 16:14
ulimit -n says 1024. Oh thanks, I will leave the configuration as default and also update index mappings. – Carlos Vega May 05 '15 at 16:16
I'm curious about the "mlockall" : false because on the yml file I did write bootstrap.mlockall: true – Carlos Vega May 05 '15 at 16:23
2

Thanks, I solved it using this: #don't use all processors processors: 6 threadpool: get: type: fixed size: 30 queue_size: 3000 search: type: fixed size: 30 queue_size: 3000 index.number_of_shards: 2 index.number_of_replicas: 0 – Carlos Vega May 05 '15 at 16:51

score 1 · Answer 5 · answered Nov 25 '15 at 06:15

1

I agree with @Philip's opinion, But it's necessary to restart elasticsearch at least on Elasticsearch >=1.5.2, because you can dynamically set threadpool.search.queue_size.

curl -XPUT http://your_es:9200/_cluster/settings
{
    "transient":{
        "threadpool.search.queue_size":10000
    }
}

answered Nov 25 '15 at 06:15

Gary Gauh

4,984
5
30
43

2

With Elasticsearch >= version 5, it's not possible - https://discuss.elastic.co/t/transient-setting-threadpool-search-queue-size-not-dynamically-updateable/72576 . You have to use yaml config file. – Xdg Jun 01 '17 at 05:27

score 0 · Answer 6 · answered Feb 07 '18 at 12:04

from Elasticsearch >= version 5, its not possible to update cluster settings for thread_pool.search.queue_size using _cluster/settings API. In my case updating ElasticSearch Node yml file is not an option either since if node fails then auto scaling code would bring other ES node with default yml settings.

I have a cluster with 3 nodes and having 400 active primary shards with 7 active threads for queue size of 1000. Increasing number of nodes to 5 with similar config has resolved the issue as queries are getting distributed horizontally to more available nodes.

score 0 · Answer 7 · answered May 14 '19 at 11:09

this will not work on elasticsearch 5.6.

{
"error": {
    "root_cause": [
        {
            "type": "remote_transport_exception",
            "reason": "[colmbmiscxx.xx][172.29.xx.xx:9300][cluster:admin/settings/update]"
        }
    ],
    "type": "illegal_argument_exception",
    "reason": "transient setting [threadpool.search.queue_size], not dynamically updateable"
},
"status": 400

}

Courier Fetch: shards failed

7 Answers7

Linked