0

I have 2 service. One is hosted in Google App Engine and one is hosted in Cloud Run.

I use urlfetch (Python 2) imported from google.appengine.api in GAE to call APIs provided by the Cloud Run.

Occasionally there are a few (like <10 per week) DeadlineExceededError shown up like this:

Deadline exceeded while waiting for HTTP response from URL

But these few days such error suddenly occurs frequently (like ~40 per day). Not sure if it is due to Christmas peak hour or what.

I've checked Load Balancer logs of Cloud Run and turned out the request has never reached the Load Balancer.

Has anyone encountered similar issue before? Is anything wrong with GAE urlfetch?

I found a conversion which is similar but the suggestion was to handle the error...

Wonder what can I do to mitigate the issue. Many thanks.


Update 1

Checked again, found some requests from App Engine did show up in Cloud Run Load Balancer logs but the time is weird:

e.g.

Logs from GAE project

10:36:24.706 send request
10:36:29.648 deadline exceeded

Logs from Cloud Run project

10:36:35.742 reached load balancer
10:36:49.289 finished processing

Not sure why it took so long for the request to reach the Load Balancer...


Update 2

I am using GAE Standard located in US with the following settings:

runtime: python27
api_version: 1
threadsafe: true

automatic_scaling:
  max_pending_latency: 5s

inbound_services:
- warmup
- channel_presence

builtins:
- appstats: on
- remote_api: on
- deferred: on

...

The Cloud Run hosted API gateway I was trying to call is located in Asia. In front of it there is a Google Load Balancer whose type is HTTP(S) (classic).


Update 3

I wrote a simple script to directly call Cloud Run endpoint using axios (whose timeout is set to 5s) periodically. After a while some requests were timed out. I checked the logs in my Cloud Run project, 2 different phenomena were found:

For request A, pretty much like what I mentioned in Update 1, logs were found for both Load Balancer and Cloud Run revision.

Time of CR revision log - Time of LB log > 5s so I think this is an acceptable time out.

But for request B, no logs were found at all.

So I guess the problem is not about urlfetch nor GAE?

anniex
  • 326
  • 3
  • 7
  • Hi @anniex Were you able to solve your issue with the answer I provided below? – Priyashree Bhadra Jan 04 '22 at 06:13
  • @PriyashreeBhadra no luck. Instead of setting the default timeout, I submitted deadline=10 per request. The issue is still there unfortunately. I suspect the cause isn't the deadline because I cannot even see the request in Cloud Run Load Balancer logs. – anniex Jan 05 '22 at 07:14
  • I see the update in your question. Now you are able to see the requests landing in Load Balancer but you are facing latencies right? I am just guessing some pointers for you, please check them one by one. 1) App Engine standard environment scaling settings might cause latency if set too aggressively. Don't use App Engine standard environment basic scaling for latency sensitive applications.Scale up before migrating traffic. Newer instances may not have warmed up local caches and hence may serve more slowly than older instances. Set min number of instances – Priyashree Bhadra Jan 05 '22 at 13:05
  • The load balancer will stop routing requests to instances that fail health checks. This might increase load on other instances, potentially resulting in a cascading failure. The App Engine flexible environment Nginx logs show instances that fail health checks. Analyze your logs to determine why the instance went unhealthy. Note that there will be a short delay before the load balancer stops routing traffic to an unhealthy instance. This delay might cause an error spike if the [load balancer cannot retry requests](https://cloud.google.com/load-balancing/docs/https#timeouts_and_retries) – Priyashree Bhadra Jan 05 '22 at 13:08
  • Check your HTTP keepalive timeout, whose value is fixed at 10 minutes (600 seconds). This value is not configurable by modifying your backend service. You must configure the web server software used by your backends so that its keepalive timeout is longer than 600 seconds to prevent connections from being closed prematurely by the backend. Also check your backend service timeout. Implement retry logic. P.S- these are some common pointers I thought of telling you. Please elaborate on your architecture set up so that I can do a deep analysis. Thanks and have a great day ahead! – Priyashree Bhadra Jan 05 '22 at 13:16
  • @PriyashreeBhadra thanks for the insight. for HTTP keepalive timeout, do u mean i should check the web server software used in my Cloud Run which hosts the API gateway? if so, those micro services use node js server whose keepAliveTimeout is set to 620000. – anniex Jan 06 '22 at 03:21
  • I suspect the issue here is mixing sync and async code. NDB's async tasklets are no longer recommended. They were created when App Engine only supported a single thread, providing a relatively simple way to write asynchronous code under this limitation. See the docs here: https://cloud.google.com/appengine/docs/standard/python/ndb/async My recommendation would be to simply use threads instead of tasklets. Each thread can then use synchronous operations and should work as expected. – Priyashree Bhadra Jan 09 '22 at 08:08
  • And if you want to continue with the current solution, I suggest you look through the code for any synchronous operations taking place in the tasklets, and replace these with asynchronous operations. – Priyashree Bhadra Jan 09 '22 at 08:13
  • @PriyashreeBhadra can you provide a simple example of using threads instead of tasklets? – anniex Jan 10 '22 at 02:58
  • @PriyashreeBhadra i encountered time out error with simple axios calling cloud run as well - does that mean the issue might not be related to ndb urlfetch? fyi the journey should be `client (ndb urlfetch / axios) -> cloudflare -> google LB -> krakend (cloud run) -> service (cloud run)` – anniex Jan 10 '22 at 04:42
  • A guide about migrating from [AppEngine's NDB to Cloud NDB](https://codelabs.developers.google.com/codelabs/cloud-gae-python-migrate-2-cloudndb#0), an example of using [threads in urlfetches with urllib2](https://stackoverflow.com/a/16182076/15803365), another alternative with [threaded worker queues](https://www.titanwolf.org/Network/q/afe2a0ad-8b51-492a-adfc-0970d18b8a8c/y) and finally the link I suggested in comment 7 leads to code found in https://github.com/GoogleCloudPlatform/python-docs-samples/tree/main/appengine/standard/ndb/async. You can add threads to solve the issue. – Priyashree Bhadra Jan 11 '22 at 05:20
  • Also, if you are getting timeout errors with simple axios calling, I recommend you to check once again the [request timeout setting](https://cloud.google.com/run/docs/configuring/request-timeout)The timeout is set by default to 5 minutes and can be extended up to 60 minutes. Change to 30 minutes may be see if it runs without timeout and then slowly decrease the values from 30 to realise the point where the timeout actually happens and app hangs because the response is never returned. Check your language framework to see whether it has its own request timeout setting that you must also update. – Priyashree Bhadra Jan 11 '22 at 05:26
  • Please provide with a Minimal, Reproducible Example. Also, as its an intermittent issue, suggest to file a [Google Cloud support case](https://console.cloud.google.com/support) – Gourav B Jan 27 '22 at 13:59

1 Answers1

0

Deadline exceeded while waiting for HTTP response from URL is actually a DeadlineExceededError. The URL was not fetched because the deadline was exceeded. This can occur with either the client-supplied deadline (which you would need to change), or the system default if the client does not supply a deadline parameter.

When you are making a HTTP request, App Engine maps this request to URLFetch. URLFetch has its own deadline that is configurable. See the URLFetch documentation.

You can set a deadline for each URLFetch request. By default, the deadline for a fetch is 5 seconds. You can change this default by:

Including the following appengine.api.urlfetch.defaultDeadline setting in your appengine-web.xml configuration file. Specify the timeout in seconds:

<system-properties>:
    <property name="appengine.api.urlfetch.defaultDeadline" value="10"/>
</system-properties>

You can also adjust the default deadline by using the urlfetch.set_default_fetch_deadline() function. This function stores the new default deadline on a thread-local variable, so it must be set for each request, for example, in a custom middleware.

from google.appengine.api import urlfetch 
urlfetch.set_default_fetch_deadline(45)

If your Cloud Run service is processing long requests, you can increase the request timeout. If your service doesn't return a response within the time specified, the request ends and the service returns an HTTP 504 error. Update the timeoutSeconds attribute in YAML file as :

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: SERVICE
spec:
  template:
    spec:
      containers:
      - image: IMAGE
      timeoutSeconds: VALUE

OR

You can update the request timeout for a given revision at any time by using the following command:

gcloud run services update [SERVICE] --timeout=[TIMEOUT]

If requests are terminating earlier with error code 503, you might need to update the request timeout setting for your language framework: Node.js developers might need to update the [server.timeout property via server.setTimeout][6] (use server.setTimeout(0) to achieve an unlimited timeout) depending on the version you are using. Python developers need to update Gunicorn's default timeout.

Priyashree Bhadra
  • 3,182
  • 6
  • 23