Datastax 5.0.3 Solr node does not reindex just goes down

Question

I am trying to reindex one of my indexes deleting old data and full-index. When i try to do this. The node immediately goes down, i cannot run nodetool commands such as nodetool status, nodetool tpstats.Also, i cannot see CPU, disk, network. The node is idle. When i look at the processes on my machine, I can see that is cassandra process.

my schema.xml like that

<field indexed="true" name="key1"   stored="true" type="StrField"/>
<field indexed="true" name="key2"   stored="true" type="StrField"/>
<field indexed="true" name="key3"   stored="true" type="StrField"/>
<field indexed="true" name="key4"   stored="true" type="TrieIntField"/>
<field indexed="true" name="key5"   stored="true" type="StrField"/>
<field indexed="true" name="key6"   stored="true" type="TrieIntField"/>
<field indexed="true" name="key7"   stored="true" type="StrField"/>
<field indexed="true" name="key8"   stored="true" type="StrField"/>
<field indexed="true" name="key9"   stored="true" type="StrField"/>
<field indexed="true" name="key10"  stored="true" type="StrField"/>
<field indexed="true" name="key11"  stored="true" type="StrField"/>
<field indexed="true" name="key12"  stored="true" type="TrieDateField"/>
<field indexed="true" name="key13"  stored="true" type="StrField"/>
<field indexed="true" name="key14"  stored="true" type="StrField"/>
<field indexed="true" name="key15"  stored="true" type="StrField"/>
<field indexed="true" name="key16"  stored="true" type="StrField"/>
<field indexed="true" name="key17"  stored="true" type="StrField"/>
<field indexed="true" name="key18"  stored="true" type="StrField"/>
<field indexed="true" name="key19"  stored="true" type="StrField"/>
<field indexed="true" name="key20"  stored="true" type="TrieIntField"/>
<field indexed="true" name="key21"  stored="true" type="TrieIntField"/>
<field indexed="true" name="col1"   stored="true" type="TrieDoubleField"/>
<field indexed="true" name="col2"   stored="true" type="TrieDoubleField"/>
<field indexed="true" name="col3"   stored="true" type="TrieLongField"/>
<uniqueKey>(key1 ... key21)</uniqueKey>

java settings

<pre>
-javaagent:/usr/share/dse/cassandra/lib/jamm-0.3.0.jar
-XX:+UseThreadPriorities
-XX:ThreadPriorityPolicy=42
-Xms32G
-Xmx32G
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/disk1/cassandra/dumps/cassandra-1477648403-pid42715.hprof
-Xss512k
-XX:+AlwaysPreTouch
-XX:-UseBiasedLocking
-XX:StringTableSize=1000003
-XX:+UseTLAB
-XX:+ResizeTLAB
-XX:CompileCommandFile=/etc/dse//cassandra/hotspot_compiler
-XX:+UseG1GC
-XX:G1RSetUpdatingPauseTimePercent=5
-XX:MaxGCPauseMillis=200
-XX:ParallelGCThreads=16 changed defaults settings
-XX:ConcGCThreads=16 changed defaults settings
-XX:InitiatingHeapOccupancyPercent=40
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintPromotionFailure
-XX:PrintFLSStatistics=1
-Xloggc:/var/log/cassandra/gc-1477648403.log
-Xloggc:/var/log/cassandra/gc.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=10M

The pause times when i started the re-index

2016-10-27T22:25:36.381+0200: 16133.424: Total time for which application threads were stopped: 28.6449951 seconds, Stopping threads took: 0.0000877 seconds
2016-10-27T22:25:36.396+0200: 16133.440: Total time for which application threads were stopped: 0.0152024 seconds, Stopping threads took: 0.0001349 seconds
2016-10-27T22:25:36.409+0200: 16133.452: Total time for which application threads were stopped: 0.0123116 seconds, Stopping threads took: 0.0000487 seconds
2016-10-27T22:25:50.745+0200: 16147.788: Total time for which application threads were stopped: 14.3359300 seconds, Stopping threads took: 0.0000616 seconds
2016-10-27T22:25:50.761+0200: 16147.804: Total time for which application threads were stopped: 0.0159109 seconds, Stopping threads took: 0.0001860 seconds
2016-10-27T22:25:50.774+0200: 16147.818: Total time for which application threads were stopped: 0.0136620 seconds, Stopping threads took: 0.0000423 seconds
2016-10-27T22:26:04.853+0200: 16161.897: Total time for which application threads were stopped: 14.0787608 seconds, Stopping threads took: 0.0000590 seconds
2016-10-27T22:26:04.871+0200: 16161.914: Total time for which application threads were stopped: 0.0171435 seconds, Stopping threads took: 0.0001669 seconds
2016-10-27T22:26:18.949+0200: 16175.993: Total time for which application threads were stopped: 14.0786946 seconds, Stopping threads took: 0.0000607 seconds
2016-10-27T22:26:18.968+0200: 16176.011: Total time for which application threads were stopped: 0.0184226 seconds, Stopping threads took: 0.0002338 seconds
2016-10-27T22:26:18.979+0200: 16176.022: Total time for which application threads were stopped: 0.0108393 seconds, Stopping threads took: 0.0000463 seconds
2016-10-27T22:26:33.210+0200: 16190.254: Total time for which application threads were stopped: 14.2314474 seconds, Stopping threads took: 0.0000724 seconds
2016-10-27T22:26:33.228+0200: 16190.272: Total time for which application threads were stopped: 0.0172137 seconds, Stopping threads took: 0.0000845 seconds
2016-10-27T22:26:33.243+0200: 16190.286: Total time for which application threads were stopped: 0.0147185 seconds, Stopping threads took: 0.0000770 seconds
2016-10-27T22:26:47.315+0200: 16204.358: Total time for which application threads were stopped: 14.0715678 seconds, Stopping threads took: 0.0001087 seconds
2016-10-27T22:26:47.335+0200: 16204.379: Total time for which application threads were stopped: 0.0203852 seconds, Stopping threads took: 0.0002385 seconds
2016-10-27T22:26:47.347+0200: 16204.391: Total time for which application threads were stopped: 0.0120962 seconds, Stopping threads took: 0.0001291 seconds
2016-10-27T22:27:01.590+0200: 16218.633: Total time for which application threads were stopped: 14.2425348 seconds, Stopping threads took: 0.0000658 seconds
2016-10-27T22:27:01.608+0200: 16218.652: Total time for which application threads were stopped: 0.0175375 seconds, Stopping threads took: 0.0002150 seconds
2016-10-27T22:27:01.622+0200: 16218.665: Total time for which application threads were stopped: 0.0134276 seconds, Stopping threads took: 0.0000305 seconds
2016-10-27T22:27:15.820+0200: 16232.863: Total time for which application threads were stopped: 14.1976605 seconds, Stopping threads took: 0.0000580 seconds
2016-10-27T22:27:15.839+0200: 16232.883: Total time for which application threads were stopped: 0.0181183 seconds, Stopping threads took: 0.0001094 seconds
2016-10-27T22:27:15.854+0200: 16232.898: Total time for which application threads were stopped: 0.0146620 seconds, Stopping threads took: 0.0000391 seconds
2016-10-27T22:27:29.911+0200: 16246.954: Total time for which application threads were stopped: 14.0568599 seconds, Stopping threads took: 0.0000689 seconds
2016-10-27T22:27:29.930+0200: 16246.973: Total time for which application threads were stopped: 0.0184539 seconds, Stopping threads took: 0.0002259 seconds
2016-10-27T22:27:29.942+0200: 16246.985: Total time for which application threads were stopped: 0.0120611 seconds, Stopping threads took: 0.0000643 seconds
2016-10-27T22:27:44.239+0200: 16261.282: Total time for which application threads were stopped: 14.2971941 seconds, Stopping threads took: 0.0000682 seconds
2016-10-27T22:27:44.256+0200: 16261.299: Total time for which application threads were stopped: 0.0165530 seconds, Stopping threads took: 0.0001018 seconds
2016-10-27T22:27:44.268+0200: 16261.312: Total time for which application threads were stopped: 0.0126950 seconds, Stopping threads took: 0.0000236 seconds
2016-10-27T22:27:58.407+0200: 16275.451: Total time for which application threads were stopped: 14.1388198 seconds, Stopping threads took: 0.0000495 seconds
2016-10-27T22:27:58.426+0200: 16275.470: Total time for which application threads were stopped: 0.0189899 seconds, Stopping threads took: 0.0001773 seconds
2016-10-27T22:27:58.437+0200: 16275.480: Total time for which application threads were stopped: 0.0104863 seconds, Stopping threads took: 0.0000825 seconds
2016-10-27T22:28:12.621+0200: 16289.664: Total time for which application threads were stopped: 14.1838416 seconds, Stopping threads took: 0.0000498 seconds
2016-10-27T22:28:12.638+0200: 16289.681: Total time for which application threads were stopped: 0.0167055 seconds, Stopping threads took: 0.0002064 seconds
2016-10-27T22:28:12.652+0200: 16289.696: Total time for which application threads were stopped: 0.0140258 seconds, Stopping threads took: 0.0000369 seconds

Here is my dstat result

 read  writ|run blk new| used  buff  cach  free|  in   out | read  writ| int   csw |usr sys idl wai hiq siq| recv  send
   0  0.50 |1.0   0 0.4|39.4G 47.9M 22.0G 1384M|   0     0 |   0  7782B| 599  3866 |  3   0  97   0   0   0|2102B 1587B
   0  0.40 |1.0   0 0.1|39.4G 47.9M 22.0G 1384M|   0     0 |   0  2458B|3448  3917 |  3   0  97   0   0   0|1533B 1212B
   0  1.50 |1.1   0 0.3|39.4G 47.9M 22.0G 1387M|   0     0 |   0  8192B| 575  3888 |  3   0  97   0   0   0|1448B  993B
   0  0.80 |4.2   0   0|39.4G 47.9M 22.0G 1386M|   0     0 |   0  5325B|5308  4954 | 18   0  82   0   0   0|1631B 1109B
   0  0.20 |1.0   0 0.2|39.4G 47.9M 22.0G 1386M|   0     0 |   0  2048B| 792  4099 |  3   0  97   0   0   0|2016B 1670B
   0  0.90 |1.0   0   0|39.4G 47.9M 22.0G 1388M|   0     0 |   0  5325B| 988  3838 |  3   0  97   0   0   0|1697B 1150B
   0  0.20 |1.0   0 0.4|39.4G 47.9M 22.0G 1389M|   0     0 |   0  1638B| 624  3851 |  3   0  97   0   0   0|1559B 1126B
   0  0.60 |1.0   0 0.1|39.4G 47.9M 22.0G 1390M|   0     0 |   0  5734B| 698  3801 |  3   0  97   0   0   0|1654B 1091B
   0  0.60 |7.4   0   0|39.4G 47.9M 22.0G 1386M|   0     0 |   0  4506B|5171  5278 | 18   0  82   0   0   0|1512B 1402B
   0  0.50 |1.0   0   0|39.4G 47.9M 22.0G 1389M|   0     0 |   0  2048B| 604  3813 |  3   0  97   0   0   0|1669B 1064B
   0  0.70 |1.1   0 202|39.4G 47.9M 22.0G 1375M|   0     0 |   0  9011B|2527  5829 |  3   1  96   0   0   0|2682B 6059B
   0  0.50 |1.2   0 126|39.4G 47.9M 22.0G 1379M|   0     0 |   0    10k| 870  5047 |  3   0  96   0   0   0|2179B 3525B
   0  0.40 |4.1   0 0.3|39.4G 48.0M 22.0G 1380M|   0     0 |   0  2458B|6802  5359 | 18   0  82   0   0   0|1723B 1943B
   0  3.00 |1.0   0 0.1|39.4G 48.0M 22.0G 1384M|   0     0 |   0    19k| 594  3874 |  3   0  97   0   0   0|2028B 1434B
   0  0.20 |1.0   0   0|39.4G 48.0M 22.0G 1386M|   0     0 |   0  2867B|1622  3889 |  3   0  97   0   0   0|2070B 1449B
   0  0.60 |1.0 0.1   0|39.4G 48.0M 22.0G 1387M|   0     0 |   0  4506B| 604  3837 |  3   0  97   0   0   0|2043B 1369B

Please clarify, does the process go down or does it stay up and become unresponsive? — phact, Oct 27 '16 at 10:12
@phact i see the cassandra process when i run "top" command , but opscenter says its down, when i run command "nodetool status" on other nodes in my cluster, the result shows me that the node is down — Yılmaz, Oct 27 '16 at 10:22
In that case the process us up but unresponsive and has been marked down by the other nodes. Can you post a few lines of `dstat -rvn 10` — phact, Oct 27 '16 at 10:54
@phact 0.50 |4.1 0 0.8|38.6G 43.1M 21.8G 2398M| 0 0 | 0 5325B|1470 3853 | 5 0 95 0 0 0|1203B 1355B 81.5 15.3 |8.4 0 15|38.7G 43.2M 21.8G 2309M| 0 0 |1418k 1088k| 13k 11k| 27 0 72 0 0 0| 57k 7894B 0 0.20 |1.0 0 6.9|38.7G 43.2M 21.8G 2308M| 0 0 | 0 1229B| 716 4165 | 3 0 97 0 0 0|6736B 5127B 0 2.50 |1.0 0 0|38.7G 43.2M 21.8G 2307M| 0 0 | 0 13k|1377 3891 | 3 0 97 0 0 0|1331B 6440B — Yılmaz, Oct 27 '16 at 14:41
@phact when i look at the gc pauses i see huge pauses. I reduced gc pauses setting these parameters, XX:MaxGCPauseMillis=100", JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=32", -XX:ConcGCThreads=32" but still gc pauses my node 45 seconds sequentially. — Yılmaz, Oct 27 '16 at 14:44
@phact I tried to reduce MAX_HEAP_SIZE and HEAP_NEWSIZE. but result was the same. — Yılmaz, Oct 27 '16 at 14:46
Can you put it in the main question with formatting? This is hard to read... — phact, Oct 27 '16 at 14:50
@phact it is done. I sorry for that i didnt know where to put. — Yılmaz, Oct 27 '16 at 15:25
@phact my observation until now that it does not matter how much heap space i set such as 14GB, 32GB, 48GB, as soon as i run reindex command, the heap got filled and long gc pause time occurs. The bigger heap, the more pause time. — Yılmaz, Oct 27 '16 at 16:14
That machine is pretty idle, I don't think the processing was GCing when you ran dstat — phact, Oct 27 '16 at 18:08
@phact I can see that in GC log files. I am gonna put pause logs for you — Yılmaz, Oct 27 '16 at 20:25
What are the specs on your machines. I can almost guarantee the two were not taken at the same time. Reduce the number of indexers and set the heap to 30gb if you have them using G1GC. — phact, Oct 27 '16 at 22:51
Can you re run dstat when you start the node and this time use rvnt to include timestamps to correlate with the gc logs? — phact, Oct 28 '16 at 00:29
@phact the gc logs that i sent from after running re-index command. if i dont run reindex command. gc pause times does not take much time. — Yılmaz, Oct 28 '16 at 07:55
@phact this setting was max_solr_concurrency_per_core: 4, i set it max_solr_concurrency_per_core: 1, then i was able to perform indexing without going down. Then i set it max_solr_concurrency_per_core: 2, started re-indexing tho node goes down. Now i set it max_solr_concurrency_per_core: 1, i am able to perform indexing but indexing throughput too low. — Yılmaz, Oct 28 '16 at 09:28
@phact I have 32 core machine 2 threads per core. I have also raid 1 SSD for commitlogs, raid1 HDD for cassandra data, raid1 HDD for solrdata, 64 GB RAM — Yılmaz, Oct 28 '16 at 14:19
Wow, then you should be able to do much higher concurrency, and that explains your low CPU utilization. What about RAM? And how large are your indexes? — phact, Oct 28 '16 at 14:21
@phact I have 3 SOLR nodes, opscenter says that each one has 110GB data. — Yılmaz, Oct 28 '16 at 14:49
Okay, this sounds like a poorly tuned JVM. What are your exact settings? Post your env.sh — phact, Oct 28 '16 at 14:58
@phact actually i was using default settings. i just changed two settings. you can see above. — Yılmaz, Oct 28 '16 at 15:29
@phact indeed, i was able to reindex for another core in another node using default settings. — Yılmaz, Oct 28 '16 at 15:30
Try increasing gcpause millies to 2000. Are the cells / rows much bigger than on the other table? Can you post your schema.xml and solrconfig.xml — phact, Oct 28 '16 at 15:55
@phact i did the result is the same. By the way, yes i have bigger row than the core i was able to get it re-indexed. — Yılmaz, Oct 28 '16 at 21:20
How big? Can you share cfstats and cfhistograms and the schema.xml — phact, Oct 28 '16 at 21:40
@phact i have 1 million rows and i put some example xml in my question. — Yılmaz, Oct 28 '16 at 23:42

Datastax 5.0.3 Solr node does not reindex just goes down

0 Answers0

Linked