6

SOLVED

In our case the problem was that for the SuggestRequestHandler (requestHandler name="/suggest") now facelimit has been set: 10 Also there has been several requests for each single suggest request made by the application. Why this led to a (just) hourly peak is quite not clear yet...

Thank you all for tips and help - I appreciated!

Every full hour (12:00, 13:00, 14:00, ..., 20:00, 21:00, 22:00, 23:00) our Solr/Java process has a peak - which means the Java process where Solr is running increases 3x times CPU usage and the response time take - which usually takes msecs to respond, up to 9 seconds. Always for 2 - 3 minutes and only if there is traffic on our site (there is a php application which calls Java). Crond was completely disabled but still the problem on every full hour. And basically I think we tried almost every GC and memory combination (or maybe not?)

Someone any idea why this happens - here some details:

  • System: 32 GB RAM, 24 Core (mostly shared with php-fpm, but even isolated just solr as test same problem)
  • Solr version 3.6 (on Jetty - temporarily also Glassfish)
  • OS: RHEL 5.7
  • Multicore setup (4 indexes with each 2 cores)

Used Handlers (solrconfig.xml):

<requestHandler name="standard" class="solr.SearchHandler" default="true">
<requestHandler name="dismax" class="solr.SearchHandler" >
<requestHandler name="/suggest" class="solr.SearchHandler">
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />
<requestHandler name="/analysis/document" class="solr.DocumentAnalysisRequestHandler" />
<requestHandler name="/analysis/field" class="solr.FieldAnalysisRequestHandler" />
<requestHandler name="/admin/" class="org.apache.solr.handler.admin.AdminHandlers" />
<requestHandler name="/admin/ping" class="PingRequestHandler">
<requestHandler name="/debug/dump" class="solr.DumpRequestHandler" >
<requestHandler name="/replication" class="solr.ReplicationHandler" >

(also tested without replication and ping)

Used filters:

<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PortugueseMinimalStemFilterFactory"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.PortugueseMinimalStemFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="30" minGramSize="1"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory" />

Index size: ~100 MB (actually even a bit less)

Current Java Options:

JAVA_OPTS="-Xmx4096m -Xms4096m -XX:+UseGCOverheadLimit -XX:+UseConcMarkSweepGC -XX:+UseTLAB -XX:MaxPermSize=128m -XX:+DisableExplicitGC -Dsun.rmi.dgc.server.gcInterval=300000 -Dsun.rmi.dgc.client.gcInterval=300000 -XX:NewRatio=1 -Xloggc:/shop/logs/live/solr/gc.log -verbose:gc -XX:+PrintGCDateStamps"

The same options but with 1024, 2048, 8192 and 12 GB didn't help at all.

Other try:

JAVA_OPTS="-server -Xmx2048m -XX:MaxPermSize=128m -XX:+UseParNewGC     -XX:+UseConcMarkSweepGC -XX:+UseTLAB -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing -XX:CMSIncrementalDutyCycleMin=0 -XX:CMSIncrementalDutyCycle=10 -XX:MaxTenuringThreshold=0 -XX:SurvivorRatio=256 -XX:CMSInitiatingOccupancyFraction=60 -XX:+DisableExplicitGC"

Other try:

JAVA_OPTS="-Xmx2048m -Xms2048m -XX:+UseGCOverheadLimit -XX:+UseConcMarkSweepGC -XX:+UseTLAB -XX:MaxPermSize=128m -XX:+DisableExplicitGC -Djava.util.logging.config.file=/opt/solr-jetty/etc/jetty-logging.properties"

Here a excerpt of the gc.log (of such a full hour problem):

2013-03-03T19:59:04.157-0300: 8087.754: [GC 3433559K->1788819K(3914560K), 0.0358190 secs]
2013-03-03T19:59:12.031-0300: 8095.628: [GC 3437075K->1792088K(3914560K), 0.0365830 secs]
2013-03-03T19:59:22.419-0300: 8106.016: [GC 3440344K->1803266K(3914560K), 0.0422040 secs]
2013-03-03T19:59:29.044-0300: 8112.641: [GC 3451522K->1815743K(3914560K), 0.0439870 secs]
2013-03-03T19:59:37.002-0300: 8120.599: [GC 3463999K->1821601K(3914560K), 0.0378990 secs]
2013-03-03T19:59:45.468-0300: 8129.065: [GC 3469857K->1822911K(3914560K), 0.0386720 secs]
2013-03-03T19:59:53.750-0300: 8137.347: [GC 3471167K->1829299K(3914560K), 0.0405040 secs]
2013-03-03T20:00:01.829-0300: 8145.426: [GC 3477555K->1832046K(3914560K), 0.0383070 secs]
2013-03-03T20:00:06.327-0300: 8149.924: [GC 3480302K->1831567K(3914560K), 0.0450550 secs]
2013-03-03T20:00:11.123-0300: 8154.719: [GC 3479823K->1843283K(3914560K), 0.0401710 secs]
2013-03-03T20:00:14.360-0300: 8157.957: [GC 3491539K->1854079K(3914560K), 0.0368560 secs]
2013-03-03T20:00:17.419-0300: 8161.015: [GC 3502335K->1855130K(3914560K), 0.0375530 secs]
2013-03-03T20:00:20.006-0300: 8163.603: [GC 3503386K->1861867K(3914560K), 0.0413470 secs]
2013-03-03T20:00:22.726-0300: 8166.323: [GC 3510123K->1870292K(3914560K), 0.0360600 secs]
2013-03-03T20:00:25.420-0300: 8169.017: [GC 3518548K->1872701K(3914560K), 0.0326970 secs]
2013-03-03T20:00:27.138-0300: 8170.735: [GC 3520957K->1873446K(3914560K), 0.0381430 secs]
2013-03-03T20:00:28.748-0300: 8172.345: [GC 3521702K->1889189K(3914560K), 0.0379160 secs]
2013-03-03T20:00:30.404-0300: 8174.001: [GC 3537445K->1887193K(3914560K), 0.0407670 secs]
2013-03-03T20:00:32.713-0300: 8176.309: [GC 3535449K->1892863K(3914560K), 0.0366880 secs]
2013-03-03T20:00:34.791-0300: 8178.388: [GC 3541119K->1899095K(3914560K), 0.0398270 secs]
2013-03-03T20:00:36.533-0300: 8180.129: [GC 3547351K->1910071K(3914560K), 0.0373960 secs]
2013-03-03T20:00:39.037-0300: 8182.634: [GC 3558327K->1904198K(3914560K), 0.0393020 secs]
2013-03-03T20:00:41.548-0300: 8185.144: [GC 3552454K->1912352K(3914560K), 0.0444060 secs]
2013-03-03T20:00:43.771-0300: 8187.368: [GC 3560608K->1919304K(3914560K), 0.0427220 secs]
2013-03-03T20:00:47.411-0300: 8191.008: [GC 3566354K->1918102K(3914560K), 0.0418150 secs]
2013-03-03T20:00:50.925-0300: 8194.522: [GC 3564290K->1930888K(3914560K), 0.0414700 secs]
2013-03-03T20:00:52.991-0300: 8196.588: [GC 3579144K->1933251K(3914560K), 0.0349600 secs]
2013-03-03T20:00:53.027-0300: 8196.624: [GC 1939697K(3914560K), 0.0256300 secs]
2013-03-03T20:00:54.208-0300: 8197.804: [GC 2780505K(3914560K), 0.1424860 secs]
2013-03-03T20:00:55.684-0300: 8199.281: [GC 3029503K->1389766K(3914560K), 0.0370380 secs]
2013-03-03T20:00:58.289-0300: 8201.886: [GC 2213458K->570843K(3914560K), 0.0413220 secs]
2013-03-03T20:01:00.672-0300: 8204.268: [GC 1962741K->319619K(3914560K), 0.0410840 secs]
2013-03-03T20:01:02.906-0300: 8206.503: [GC 1966833K->319605K(3914560K), 0.0453730 secs]
2013-03-03T20:01:06.861-0300: 8210.458: [GC 1967861K->330864K(3914560K), 0.0425570 secs]
2013-03-03T20:01:10.067-0300: 8213.664: [GC 1979120K->336541K(3914560K), 0.0479380 secs]
2013-03-03T20:01:12.587-0300: 8216.184: [GC 1984797K->343203K(3914560K), 0.0376810 secs]

Also there are just 2 entries at all (about 1 day) greater than 1 second: grep -oP ", [1-9]..*?secs]$" /shop/logs/live/solr/gc.log , 1.1727270 secs] , 1.0390840 secs]

Someone any idea or already had this phenomenon with solr/jvm?

Georg Buske
  • 281
  • 3
  • 8
  • try to disable the last three request handlers in your list, see what happens. Also, how do u trigger the document analysis? – Itay Moav -Malimovka Mar 04 '13 at 02:01
  • Have you exclude GC activity? I found you had printed GC activity in `-Xloggc:/shop/logs/live/solr/gc.log`. If you did so, please include it in your question. – ericson Mar 04 '13 at 05:04
  • Is it possibly something else that runs every hour on the computer? Or a bot visiting every hour? Or your ISP throttling every hour? – Patashu Mar 04 '13 at 08:24
  • May be it's somehow related to solr autocommit\optimize feature? – Konstantin V. Salikhov Mar 04 '13 at 08:26
  • @GeorgBuske Are you using Tomcat? Which version? – fglez Mar 04 '13 at 09:49
  • @ItayMoav-Malimovka thx i will try later that day and let you know – Georg Buske Mar 04 '13 at 10:40
  • @fglez no actually Jetty (but we are considering to give Glasfish a try) – Georg Buske Mar 04 '13 at 10:57
  • @Patashu: Yes I was thinking that - as a last resort I disabled crond on all related machines during this peak time - same problem :-/ We also checked access logfiles (and strangely it is only when there is traffic on the site - so this Bot would use different IP ranges and a good strategy but still possible) – Georg Buske Mar 04 '13 at 10:58
  • Have the solr server log incoming requests, just be sure not some other host contacts you on every full hour. – phisch Mar 04 '13 at 11:00
  • @ericson: done (gc log excerpt from one of the servers now in the question) – Georg Buske Mar 04 '13 at 11:01
  • @KonstantinV.Salikhov: What you mean exactly? How can I trace a problem regarding this (we don't have a scheduled optimization every hour - but could it be somehow specified?) – Georg Buske Mar 04 '13 at 11:04
  • @GeorgBuske check this link: http://stackoverflow.com/questions/2787591/solr-autocommit-and-autooptimize – Konstantin V. Salikhov Mar 04 '13 at 11:06
  • You can use the tools mentioned in [this post](http://stackoverflow.com/questions/930915/which-java-thread-is-hogging-the-cpu/932937#932937) to identify problematic thread. – ericson Mar 04 '13 at 11:18
  • @ItayMoav-Malimovka even without PingRequst (disabled haproxy for this also), DumpRequest and ReplicationRquestHandler (disabled master replication) the same issue :/ – Georg Buske Mar 04 '13 at 15:49
  • 1
    I have to say I didn't understand what is the solution to this problem. @GeorgBuske can you explain in more detail? – Cosimo Jul 24 '13 at 12:51

2 Answers2

6

Don't believe your GC logs unless you include -XX:+PrintGCApplicationStoppedTime in options. Suspect them even then. There are pauses and parts of pauses that can be very long and go unreported unless you include this flag. E.g. I've seen pauses caused by the occasional long running counted loop taking 15 seconds to reach a safe point, where GC only reported only the .08 seconds part of the pause where it actually did some work. There are also plenty of pauses whose causes that are not considered part of "GC" and can thereby go unreported by GC logging flags.

You can try adding jHiccup as an agent to report on observed pause/glitch/stall/hiccups rather than rely in the honesty of the JVM logs. If it shows multi-sec glitches then you'll know your JVM is pausing. If it shows smooth JVM operation, then you know to look at your other configuration parts.

Gil Tene
  • 798
  • 5
  • 8
  • I would like to give you an upvote as this sounds a very nice way to debug further (which would have been one of the next step)... :-) Anyway, one question about jHiccup: As we switched temporarily to Glassfish - how it is possible to put jHiccup in front of Glassfish - just patching the bin command? – Georg Buske Mar 06 '13 at 02:01
  • There are notes in the README about various ways to insert jHiccup. I find that one of the easiest is to sneak it in with _JAVA_OPTIONS, as in: export _JAVA_OPTIONS='-javaagent:/path/to/jHiccup/bin/jHiccup.jar' – Gil Tene Mar 12 '13 at 03:39
0

if the index size is just 100MB, and the issue is related to GC I would start by:

  1. reducing -Xmx to less than 1024, start at about 256m and see if it's enough
  2. dont use any -XX at the begining
  3. use latest jdk
Persimmonium
  • 15,593
  • 11
  • 47
  • 78