1

I am experiencing a situation where one of the jobs in my Grails app stops running without any explicit reason. There is no exception thrown. We are using Grails 2.2.3 and Quartz2 plugin. Interesting thing is that all the other jobs which keep running; only one particular job keeps freezing time and again. This job makes a call to the 3rd party REST API calls which sometimes give a very delayed responses and also no responses at all in few instances. All the jobs are concurrent = false. Can someone point me to the right direction? It has been two days since I have been struggling to fix this issue. Few of the things that I have tried:

  1. Changed/Simplified the implementation of the task that the job processes. The job still makes REST API call. The response times at times are very large (upto 20 minutes) and on fewer occasions we face ConnectionTimeOut exception.
  2. Enabled the quartz logging. The job freezes and the logging does not give any error message.
  3. Installed the Grails quartz monitor plugin. We have made it inline and tweaked it to run with Quartz2 plugin. It just shows the usual quartz/list.

Have not been able to resolve the issue as yet and am running out of ideas now. Is there someone who has come across such a situation and have some tips to share. Thanks.

NOTE: Right now we have removed the call to the 3rd party REST API which was taking too long to see if the job/s runs fine for extended periods. I guess the server sometimes kills the process which are taking too long or timing out regularly.

issprof
  • 55
  • 4
  • 1
    Some extra details would be helpful. With strategy 1, did you try no longer making your REST calls? If so, does the job still stop? Also, do you have concurrent = false set? If you're trying to do work that isn't completing, the job will not fire again in that case. Also, when you are using the monitor, is the job running and not getting triggered correctly or is it stopped outright? Does the Monitor plugin work with Quartz2? From the docs it only mentions the regular Quartz plugin. – derdc Feb 25 '14 at 19:02
  • I have edited the question wrt to derdc's queries. – issprof Feb 26 '14 at 07:07
  • It seems similar to http://stackoverflow.com/questions/618265/quartz-scheduler-suddenly-stop-running-and-no-exception-error – Puneet Behl Sep 29 '16 at 11:14

1 Answers1

0

We have been able to solve this riddle. The problem was that the API calls to one of the third-party servers were not getting responses for up to 40-50 minutes and after that the server would time-out and close the connection. We had used multi-threading within each run of the job and due to some 'buggy' implementation it was not giving us a true 'concurrent=false' behavior; so in a way we have thousands of open-ended connections to this third-party server with no responses coming at all (for 40-50 minutes) for most of the requests. This is just our guess that after a while this particular job/scheduler freezes.

We were able to find out two solutions to the problem:

  1. Implement the shorter connection time-out and the read time-out with our outgoing API requests. Read what is the difference between connection time-out and read time-out here. Here is the code we wrote:

    URL url = new URL(urlString)
    HttpURLConnection httpURLConnection = (HttpURLConnection) url.openConnection() httpURLConnection.setConnectTimeout(5 * 1000 * 60)
    httpURLConnection.setReadTimeout(8 * 1000 * 60)

  2. Second solution that we were able to successfully test was to make the API calls by calling our app's action/url from the Linux crontab utility. What we did is to hit a particular URL in our app which in turn gets makes an API call to the third-party so in a way we removed the whole quartz scheduler/plugin dependency from our app i.e. we are not using quartz scheduler in this case. The only downside to this approach is that we are triggering the REST API calls from outside of our app code-base. So if we make a WAR of our app a deploy it in some another machine we will have to configure the Linux crontab as well.

We finally implemented the first solution (connection/read time-out solution) because it kept the solution withing the code-base itself (which is not possible in case of crontab utility).

Hope this helps someone or give them pointers where to look at.

Community
  • 1
  • 1
issprof
  • 55
  • 4