I have multiple threads running a natural language processing application.
Occasionally one of my threads dies with an OutOfMemoryException.
I have diagnosed the problem with a reasonable degree of confidence. As per explanation here, garbage collection is taking up all cpu time.
As per my logs:
[INFO] 2016-07-30 13:43:53,442:
Total Memory: 7165968384 (6834.0 MiB)
Max Memory: 7635730432 (7282.0 MiB)
Free Memory: 2296592136 (2190.2009353637695 MiB)
FileioThread currentFile currentFileProcessTime: 4392
Text Analysis Thread currentFile currentFileProcessTime: 244443
PersistenceThread currentFile currentFileProcessTime: 1588
[INFO] 2016-07-30 13:43:53,442: ikoda.jobanalysis.JobAnalysisThread.run(JobAnalysisThread.java:400) JobAnalysisThread run Going to sleep for 30000 millis.
[INFO] 2016-07-30 13:45:40,717: Total Memory: 6028787712 (5749.5 MiB)
Max Memory: 7635730432 (7282.0 MiB)
Free Memory: 5045040128 (4811.3251953125 MiB)
fileioThread currentFile currentFileProcessTime: 73502
Ta Thread currentFile currentFileProcessTime: 351718
persistenceThread currentFile currentFileProcessTime: 38
Key Observations: There seems to be sufficient memory in the JVM. The Text Analysis Thread has spent over 5 minutes analyzing one file (suggesting that either the file is very long or there are some very long sentences in the file). Most likely the garbagecollection is running overtime cleaning up multiple tokens per word in each sentence of each paragraph of a text file.
Current Behaviour A random thread blows up and dies with the outOfmemoryException. (Not necessarily the TextAnalysis Thread). A monitor thread ensures the dead thread really is dead, pauses for 30 seconds then creates a new instance.
Everything continues hunky dory.
Third Party Known Issue An important note here is that the text analysis thread utilizes third party Stanford NLP packages. These packages do create a lot of objects (nx objects per word) and so gc can be extreme at times, but will recover.
Question Creating replacement thread instances appears benign. Also, the program gathers big data, data loss during thread reinantiation is minor and harmless. Putting the program on a faster server will never fully solve the problem. However, pretending OutOfMemoryErrors didn't happen goes against my instincts and training. Can someone suggest some strategies for dealing with this.