0

I have a java application that's mysteriously dying without any exceptions in the logs. I'm running it in the background via a bash script that wraps a nohup like below:

nohup java -Xms6g -Xmx6g -jar myapp.jar 2>> stderr.txt >> /dev/null & echo $! > /tmp/myapp-pid

The java application is quite memory intensive and so has been configured with 6GB of heap space (running on a 64 bit JVM). It runs fine for about 8 hours and then silently dies. No exceptions in the logs, nothing.

From the main method the app enters an infinite while loop, polls AWS SQS for messages and processes them. This is all wrapped in a try-catch and I am logging in the catch. The application seems to exit after it completes a while loop as it logs the last line. e.g. The application will always end with 'Successfully processed'.

while(true) {
    try {
        // Logic to poll SQS and process the message
    } catch (MyCustomException e) {
        // Write to SQS dead letter queue (was throwing at this point)
        // Delete message from original SQS
    } catch (Throwable e) {
        LOG.error(...);
    } finally { 
        LOG.info("Processing time was...");
    }
}

I'm not sure where to begin as I would've thought it would log something. Can anyone provide some pointers or maybe some JVM settings to configure so that I can start investigating?

I am wondering if things outside the code may be causing the error. Like perhaps a JVM crash?

Update It seems like this was indeed a programming error. I didn't think it was causing the issue so I hadn't added it to the code path above (just added it now) but I did have another catch clause catching a custom Exception that I had created. Within that catch I was attempting to move the SQS message to the dead letter queue but did not have permission to it and thus was throwing inside the catch which I wasn't handling.

Thanks for all those that helped in suggesting what may have gone wrong!

Community
  • 1
  • 1
n00b
  • 5,843
  • 11
  • 52
  • 82
  • 1
    Is your logger configured to log level ERROR - not just INFO ? – tbsalling Nov 03 '15 at 12:35
  • 1
    Two questions: Do you have a log just where the "while" starts? Any other pattern other than the 8 hours you mention? – skw Nov 03 '15 at 12:37
  • 1
    From the structure of your loop, I'm guessing you have at least one `break;` command somewhere within the loop...In that case, use an editor to search-and-replace your `break;` command(s) with something like `System.out.println("Breakng out of loop!");break;`, to confirm if you're exiting the loop through one of those breaks or not; and if you are, proceed to investigate which break is being mistriggered and why. --- Or add breakpoints in each `break;` command, and use a profiler to debug them. – CosmicGiant Nov 03 '15 at 12:39
  • @tbsalling - there are ERRORs being logged, but wouldn't INFO level include ERRORs also? – n00b Nov 03 '15 at 12:55
  • @skw - there is no log at the start of the while but it logs once it polls SQS which it's not doing. 8 hours is a rough guide, it's not always 8 hours but after a long period it'll die – n00b Nov 03 '15 at 12:56
  • @TheLima - there aren't any breaks or returns in the while loop. – n00b Nov 03 '15 at 12:56
  • So, huh, you ***want*** to get an error? Because otherwise, an infinite loop without exiting conditions will do that... – CosmicGiant Nov 03 '15 at 13:00
  • why do you discard the output of stdout to dev null ? Write it to a file. And you should perhaps create a headdump on out of memory. – Marged Nov 03 '15 at 13:05
  • @Marged - perhaps I'm mistaken but I thought I was discarding the output of nohup to dev null. The app uses logback to output to a log directory which it is doing. Without the >> /dev/null I'd get duplicate logs in a 'nohup.out' file and also the logs outputted by logback. – n00b Nov 03 '15 at 13:08
  • 1
    I have seen many situations when logfile were not updated because of buffering done by the logging framework. So as long as you are hunting down an error prefer having too many logs instead of incomplete ;-) You can tune this later on when your bug is fixed. And have a look at -XX:ErrorFile and -XX:HeapDumpPath – Marged Nov 03 '15 at 13:12

2 Answers2

1

Without having more code it is hard to say what actually happens. But per definition of finally it is executed always, which means also in case of failure. Maybe you are just missing the exception which is written before it. Try to move the finally call inside the 'try'-block.

while(true) {
    try {
        // Logic to poll SQS and process the message
        LOG.info("Successfully processed");
    } catch (Throwable e) {
        //As mentioned in the comments try for debugging to log on info level here as well.
        // Maybe error level is disabled (although this should be 
        //very unlikely since error normally is written too when info is written.
        LOG.info(...);
    } finally { 
        //Clean up.
    }
}

This are two ideas which may help you further investigate your issue.

Denis Lukenich
  • 3,084
  • 1
  • 20
  • 38
  • Sorry the logging was misleading. I don't actually log a success message in the finally. I log the processing time. I'm getting both INFO and ERROR level logs so that seems fine. – n00b Nov 03 '15 at 13:05
1

Don't your system run out of memory? Try running the application from wrapping script, logging the exit code - echo $! >&2 .

Also running dmesg could tell you if oom killer chose ypur application as a victim.

Zbynek Vyskovsky - kvr000
  • 18,186
  • 3
  • 35
  • 43