0

I have a fairly big application which went over a major overhaul.

The newer version uses lot of JSONP calls and I notice 500 server errors. Nothing is logged in the logs section to determine the error cause. It happens on JS, png and even jersey (servlets) too.

Searching SO and groups suggested that these errors are common during deployment. But it happens even after hours after deployment.

BTW, the application has become slightly bigger and it even causes deadline exception while starting few instances in few rare cases. Sometimes, it starts & serves within 6-10secs. Sometimes it goes to more than 75secs thereby causing a timeout for the similar request. I see the same behavior for warmup requests too. Nothing custom is loaded during app warmup.

cloudpre
  • 1,001
  • 2
  • 15
  • 28
  • More detail would help here. How about a snippet from your logs? Or at least tell us how you are determining there are 500 errors. Are you seeing it on the GAE dashboard? Do you have [AppStats](https://developers.google.com/appengine/docs/python/tools/appstats) turned on? That might help see what's slow. – Aaron D Mar 12 '13 at 22:37
  • 500 errors come up in the browser. The issue is that they are random and do not appear in the logs. I do not have appstats turned on as it is a production app. Jersey scans for providers and it's slow. But that's a different question. – cloudpre Mar 13 '13 at 01:51
  • How is this related to your other question? http://stackoverflow.com/questions/15297961/tweak-loading-times-of-jersey-over-appengine – Aaron D Mar 18 '13 at 18:30
  • Making it slower makes the app fail to load within the stipulated 60 secs which starts throwing 500 errors. – cloudpre Mar 20 '13 at 04:18

3 Answers3

0

I feel like you should be seeing the errors in your logs. Are you exceeding quotas or having deadline errors? Perhaps you have an error in your error handler like your file cannot be found, or the path to the error handler overlaps with another static file route?

To troubleshoot, I would implement custom error pages so you could determine the actual error code. I'm assuming Python since you never specified what language you are using. Add the following to your app.yaml and create static html pages that will give the recipient some idea of what's going on and then report back with your findings:

error_handlers:
  - file: default_error.html
  - error_code: over_quota
    file: over_quota.html
  - error_code: dos_api_denial
    file: dos_api_denial.html
  - error_code: timeout
    file: timeout.html

If you already have custom error handlers, can you provide some of your app.yaml so we can help you?

Aaron D
  • 5,817
  • 1
  • 36
  • 51
  • That's the issue. It does not come in the logs. We are not exceeding quotas (the app is set to $100/day and we use only 75-80% of it). Deadline errors, we get it once a while because we are using jersey. It's not because of 404 error because I know that the file exists. And, it happens randomly. Sometimes even for static files. I will put some error handlers so that we know. – cloudpre Mar 13 '13 at 09:05
0

Some 500s are not logged in your application logs. They are failures at the front-end of GAE. If, for some reason, you have a spike in requests and new instances of your application cannot be started fast enough to serve those requests, your client may see 500s even though those 500s do not appear in your application's logs. GAE team is working to provide visibility into those front-end logs.

Carter Maslan
  • 493
  • 4
  • 11
  • I thought the front end caches the static files. Why can't the request be redirected to the already serving instances and a warmup sent to the new instance? – cloudpre Mar 15 '13 at 05:34
  • Yes, GAE does try to send requests to your already-running instances. And it starts new instances when it sees more requests than can be handled by your running instances. The unlogged 500 errors can happen when the spike in requests exceeds the rate at which new instances can be spun up. So if you're seeing this problem frequently, you may want to add reserve instances that are always available to absorb the spikes. – Carter Maslan Mar 16 '13 at 06:10
  • carter - I have more than 6 F4 (~4xF1) instances reserved for the app. The issue is that it fails randomly. – cloudpre Mar 20 '13 at 04:17
  • @cloudpre - the number and size of your reserved instances alone doesn't help in diagnosing, since we do not know your actual requests/sec load and response times. Is it impossible that your spikes exceed the capacity of those 6 reserved instances? – Carter Maslan Mar 31 '13 at 02:11
  • While it may exceed, why does it give 500 errors randomly. Why would one instance take 10secs to load while others more than 60secs. – cloudpre Apr 03 '13 at 10:40
0

I just saw this myself... I was researching some logs of visitors who only loaded half of the graphics files on a page. I tried clicking on the same link on a blog that they did to get to our site. In my case, I saw a 500 error in the chrome browser developer console for a js file. Yet when I looked at the GAE logs it said it served the file correctly with a 200 status. That js file loads other images which were not. In my case, it was an https request.

It is really important for us to know our customer experience (obviously). I wanted to let you know that this problem is still occurring. Just having it show up in the logs would be great, even attach a warm-up error to it or something so we know it is an unavoidable artefact of a complex server system (totally understandable). I just need to know if I should be adding instances or something else. This error did not wait for 60 seconds, maybe 5 to 10 seconds. It is like the round trip for SSL handshaking failed in the middle but the logs showed it as success.

So can I increase any timeout for the handshake or is that done on the browser side?