1

Currently experiencing an issue on a 10 node cluster, whereby after approx a day of running, 3 nodes will drop out (always a random 3).

Riak Version : 2.1.4

10 VM's running with 10GB Ram each, Running Oracle Linux version 7.3

Java Version :

[riak@pp2xria01trd001 riak$] java -version
openjdk version "1.8.0_121"
OpenJDK Runtime Environment (build 1.8.0_121-b13)
OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)

Our usual Riak guy is on holiday at the moment, so don't have much resource to look into. Any help or guidance on where to possibly start looking would be greatly appreciated.

Crash dump details :

Slogan: Kernel pid terminated (application_controller) ({application_terminated,yokozuna,shutdown})System version: Erlang 
R16B02_basho10 (erts-5.10.3) [source] [64-bit] [smp:2:2] [async-threads:64] [hipe] [kernel-poll:true] [frame-pointer]

Not much in the solr.log to detail why :

2017-04-06 21:04:13,958 [INFO] <qtp1924582348-828>@LogUpdateProcessorFactory.java:198 [marketblueprints_index] webapp=/internal_solr path=/update params={} {} 0 0
2017-04-06 21:04:18,567 [INFO] <qtp1924582348-855>@SolrDispatchFilter.java:732 [admin] webapp=null path=/admin/cores params={action=STATUS&wt=json} status=0 QTime=2
2017-04-06 21:04:23,573 [INFO] <qtp1924582348-1161>@SolrDispatchFilter.java:732 [admin] webapp=null path=/admin/cores params={action=STATUS&wt=json} status=0 QTime=2
2017-04-06 21:04:28,578 [INFO] <qtp1924582348-865>@SolrDispatchFilter.java:732 [admin] webapp=null path=/admin/cores params={action=STATUS&wt=json} status=0 QTime=2
2017-04-06 21:04:33,584 [INFO] <qtp1924582348-848>@SolrDispatchFilter.java:732 [admin] webapp=null path=/admin/cores params={action=STATUS&wt=json} status=0 QTime=2
2017-04-06 21:04:38,589 [INFO] <qtp1924582348-641>@SolrDispatchFilter.java:732 [admin] webapp=null path=/admin/cores params={action=STATUS&wt=json} status=0 QTime=2
2017-04-06 21:04:54,242 [INFO] <Thread-1>@Monitor.java:41 Yokozuna has exited - shutting down Solr
2017-04-06 21:04:55,219 [INFO] <Thread-2>@Server.java:320 Graceful shutdown SocketConnector@0.0.0.0:8093
2017-04-06 21:04:56,027 [INFO] <Thread-2>@Server.java:329 Graceful shutdown o.e.j.w.WebAppContext{/internal_solr,file:/var/lib/riak/yz_temp/solr-webapp/webapp/},/usr/lib64/
riak/lib/yokozuna-2.1.7-0-g6cf80ad/priv/solr/webapps/solr.war
2017-04-06 21:04:59,288 [INFO] <Thread-2>@CoreContainer.java:314 Shutting down CoreContainer instance=1916575798
2017-04-06 21:04:59,710 [INFO] <Thread-2>@SolrCore.java:1040 [feed_mapping_index]  CLOSING SolrCore org.apache.solr.core.SolrCore@78acc5b

However, after some of the merge processes in the solr.log, we are getting the following (which I suspect is preventing the supervisor from re-starting it for the 2nd time, and hence stopping Riak

2017-04-06 21:05:13,546 [INFO] <Thread-2>@CachingDirectoryFactory.java:305 Closing directory: /var/lib/riak/yz/endpoint_mappings_index/data
2017-04-06 21:05:13,547 [INFO] <Thread-2>@CachingDirectoryFactory.java:236 looking to close /var/lib/riak/yz/endpoint_mappings_index/data/index [CachedDir<<refCount=0;path=
/var/lib/riak/yz/endpoint_mappings_index/data/index;done=false>>]
2017-04-06 21:05:13,547 [INFO] <Thread-2>@CachingDirectoryFactory.java:305 Closing directory: /var/lib/riak/yz/endpoint_mappings_index/data/index
2017-04-06 21:05:14,657 [INFO] <Thread-2>@ContextHandler.java:832 stopped o.e.j.w.WebAppContext{/internal_solr,file:/var/lib/riak/yz_temp/solr-webapp/webapp/},/usr/lib64/ri
ak/lib/yokozuna-2.1.7-0-g6cf80ad/priv/solr/webapps/solr.war
2017-04-06 21:05:15,298 [WARN] <Thread-2>@QueuedThreadPool.java:145 79 threads could not be stopped

Erlang.log contains :

2017-04-06 21:04:54.193 [error] <0.5934.108> gen_server yz_solr_proc terminated with reason: {timeout,{gen_server,call,[<0.1306.0>,{spawn_connection,{url,"http://localhost:
8093/internal_solr/admin/cores?action=STATUS&wt=json","localhost",8093,undefined,undefined,"/internal_solr/admin/cores?action=STATUS&wt=json",http,hostname},100,1,{[],false
},[]}]}}
2017-04-06 21:04:54.198 [error] <0.5934.108> CRASH REPORT Process yz_solr_proc with 0 neighbours exited with reason: {timeout,{gen_server,call,[<0.1306.0>,{spawn_connection
,{url,"http://localhost:8093/internal_solr/admin/cores?action=STATUS&wt=json","localhost",8093,undefined,undefined,"/internal_solr/admin/cores?action=STATUS&wt=json",http,h
ostname},100,1,{[],false},[]}]}} in gen_server:terminate/6 line 744
2017-04-06 21:04:54.201 [error] <0.1150.0> Supervisor yz_solr_sup had child yz_solr_proc started with yz_solr_proc:start_link("/var/lib/riak/yz", "/var/lib/riak/yz_temp", 8
093, 8985) at <0.5934.108> exit with reason {timeout,{gen_server,call,[<0.1306.0>,{spawn_connection,{url,"http://localhost:8093/internal_solr/admin/cores?action=STATUS&wt=j
son","localhost",8093,undefined,undefined,"/internal_solr/admin/cores?action=STATUS&wt=json",http,hostname},100,1,{[],false},[]}]}} in context child_terminated
2017-04-06 21:04:57.422 [info] <0.1102.0>@riak_ensemble_peer:leading:631 {{kv,1141798154164767904846628775559596109106197299200,3,114179815416476790484662877555959610910619
7299200},'riak@pp2xria01trd001.pp2.williamhill.plc'}: Leading
2017-04-06 21:04:57.422 [info] <0.1090.0>@riak_ensemble_peer:leading:631 {{kv,685078892498860742907977265335757665463718379520,3,6850788924988607429079772653357576654637183
79520},'riak@pp2xria01trd001.pp2.williamhill.plc'}: Leading
2017-04-06 21:04:57.780 [info] <0.1072.0>@riak_ensemble_peer:leading:631 {{kv,0,3,0},'riak@pp2xria01trd001.pp2.williamhill.plc'}: Leading
2017-04-06 21:05:01.432 [info] <0.8030.232>@yz_solr_proc:init:119 Starting solr: "/usr/bin/riak/java" ["-Djava.awt.headless=true","-Djetty.home=/usr/lib64/riak/lib/yokozuna
-2.1.7-0-g6cf80ad/priv/solr","-Djetty.temp=/var/lib/riak/yz_temp","-Djetty.port=8093","-Dsolr.solr.home=/var/lib/riak/yz","-DhostContext=/internal_solr","-cp","/usr/lib64/r
iak/lib/yokozuna-2.1.7-0-g6cf80ad/priv/solr/start.jar","-Dlog4j.configuration=file:///etc/riak/solr-log4j.properties","-Dyz.lib.dir=/usr/lib64/riak/lib/yokozuna-2.1.7-0-g6c
f80ad/priv/java_lib","-d64","-Xms4g","-Xmx4g","-XX:+UseStringCache","-XX:+UseCompressedOops","-Dcom.sun.management.jmxremote.port=8985","-Dcom.sun.management.jmxremote.auth
enticate=false","-Dcom.sun.management.jmxremote.ssl=false","org.eclipse.jetty.start.Main"]
2017-04-06 21:05:01.483 [info] <0.1108.0>@riak_ensemble_peer:leading:631 {{kv,1370157784997721485815954530671515330927436759040,3,137015778499772148581595453067151533092743
6759040},'riak@pp2xria01trd001.pp2.williamhill.plc'}: Leading
2017-04-06 21:05:02.032 [info] <0.8030.232>@yz_solr_proc:handle_info:184 solr stdout/err: OpenJDK 64-Bit Server VM warning: ignoring option UseSplitVerifier; support was re
moved in 8.0
OpenJDK 64-Bit Server VM warning: ignoring option UseStringCache; support was removed in 8.0

2017-04-06 21:05:04.212 [info] <0.1110.0>@riak_ensemble_peer:leading:631 {{kv,1415829711164312202009819681693899175291684651008,3,0},'riak@pp2xria01trd001.pp2.williamhill.p
lc'}: Leading
2017-04-06 21:05:10.798 [info] <0.1096.0>@riak_ensemble_peer:leading:631 {{kv,913438523331814323877303020447676887284957839360,3,9134385233318143238773030204476768872849578
39360},'riak@pp2xria01trd001.pp2.williamhill.plc'}: Leading
2017-04-06 21:05:17.001 [info] <0.8030.232>@yz_solr_proc:handle_info:184 solr stdout/err: Error: Exception thrown by the agent : java.rmi.server.ExportException: Port alrea
dy in use: 8985; nested exception is:
        java.net.BindException: Address already in use (Bind failed)

2017-04-06 21:05:17.964 [error] <0.8030.232> gen_server yz_solr_proc terminated with reason: {"solr OS process exited",1}
2017-04-06 21:05:17.964 [error] <0.8030.232> CRASH REPORT Process yz_solr_proc with 0 neighbours exited with reason: {"solr OS process exited",1} in gen_server:terminate/6 
line 744
2017-04-06 21:05:17.964 [error] <0.1150.0> Supervisor yz_solr_sup had child yz_solr_proc started with yz_solr_proc:start_link("/var/lib/riak/yz", "/var/lib/riak/yz_temp", 8
093, 8985) at <0.8030.232> exit with reason {"solr OS process exited",1} in context child_terminated
2017-04-06 21:05:17.964 [error] <0.1150.0> Supervisor yz_solr_sup had child yz_solr_proc started with yz_solr_proc:start_link("/var/lib/riak/yz", "/var/lib/riak/yz_temp", 8
093, 8985) at <0.8030.232> exit with reason reached_max_restart_intensity in context shutdown
2017-04-06 21:05:17.964 [error] <0.1119.0> Supervisor yz_sup had child yz_solr_sup started with yz_solr_sup:start_link() at <0.1150.0> exit with reason shutdown in context 
child_terminated
2017-04-06 21:05:17.964 [error] <0.1119.0> Supervisor yz_sup had child yz_solr_sup started with yz_solr_sup:start_link() at <0.1150.0> exit with reason reached_max_restart_
intensity in context shutdown
2017-04-06 21:05:23.072 [error] <0.1551.0> Supervisor yz_index_hashtree_sup had child ignored started with yz_index_hashtree:start_link() at undefined exit with reason kill
ed in context shutdown_error
2017-04-06 21:05:24.353 [info] <0.745.0>@yz_app:prep_stop:74 Stopping application yokozuna.
2017-04-06 21:05:27.582 [error] <0.745.0>@yz_app:prep_stop:82 Stopping application yokozuna - exit:{noproc,{gen_server,call,[yz_solrq_drain_mgr,{drain,[]},infinity]}}.
2017-04-06 21:05:27.582 [info] <0.745.0>@yz_app:stop:88 Stopped application yokozuna.
2017-04-06 21:05:27.940 [info] <0.7.0> Application yokozuna exited with reason: shutdown
2017-04-06 21:05:28.165 [info] <0.431.0>@riak_kv_app:prep_stop:228 Stopping application riak_kv - marked service down.
2017-04-06 21:05:28.252 [info] <0.431.0>@riak_kv_app:prep_stop:232 Unregistered pb services
2017-04-06 21:05:28.408 [info] <0.431.0>@riak_kv_app:prep_stop:237 unregistered webmachine routes
2017-04-06 21:05:28.459 [info] <0.431.0>@riak_kv_app:prep_stop:239 all active put FSMs completed
2017-04-06 21:05:29.665 [info] <0.540.0>@riak_kv_js_vm:terminate:237 Spidermonkey VM (pool: riak_kv_js_hook) host stopping (<0.540.0>)
2017-04-06 21:05:29.665 [info] <0.539.0>@riak_kv_js_vm:terminate:237 Spidermonkey VM (pool: riak_kv_js_hook) host stopping (<0.539.0>)
2017-04-06 21:05:30.379 [info] <0.532.0>@riak_kv_js_vm:terminate:237 Spidermonkey VM (pool: riak_kv_js_reduce) host stopping (<0.532.0>)
2017-04-06 21:05:31.116 [info] <0.534.0>@riak_kv_js_vm:terminate:237 Spidermonkey VM (pool: riak_kv_js_reduce) host stopping (<0.534.0>)
2017-04-06 21:05:31.362 [info] <0.533.0>@riak_kv_js_vm:terminate:237 Spidermonkey VM (pool: riak_kv_js_reduce) host stopping (<0.533.0>)
2017-04-06 21:05:32.153 [info] <0.536.0>@riak_kv_js_vm:terminate:237 Spidermonkey VM (pool: riak_kv_js_reduce) host stopping (<0.536.0>)
2017-04-06 21:05:32.245 [info] <0.537.0>@riak_kv_js_vm:terminate:237 Spidermonkey VM (pool: riak_kv_js_reduce) host stopping (<0.537.0>)
2017-04-06 21:05:32.676 [info] <0.535.0>@riak_kv_js_vm:terminate:237 Spidermonkey VM (pool: riak_kv_js_reduce) host stopping (<0.535.0>)
2017-04-06 21:05:33.450 [info] <0.431.0>@riak_kv_app:stop:250 Stopped  application riak_kv.
2017-04-06 21:05:41.701 [info] <0.195.0>@riak_core_app:stop:116 Stopped  application riak_core.
2017-04-06 21:05:43.061 [info] <0.93.0> alarm_handler: {clear,system_memory_high_watermark}

We have the extra options added to riak.conf

search = on
search.solr.jmx_port = 8985
search.solr.jvm_options = -d64 -Xms4g -Xmx4g -XX:+UseStringCache -XX:+UseCompressedOops
search.solr.port = 8093
search.solr.start_timeout = 180s

No sign of any OOM errors, or processes being killed by a oom_killer

J Fryer
  • 11
  • 1
  • It looks like you already have a process bound to port 8985 (JMX). You can use the netstat command to identify the process ID. Change the value of search.solr.jmx_port if needed. – vempo May 24 '17 at 07:15

0 Answers0