2

I have an App Engine flex app that takes requests for some background computations and puts them in a task queue for processing. Requests are sent at a fairly constant rate from another process. After a fresh deploy, requests are processed quite quickly (ms), but then latency quickly increases to seconds, then minutes before becoming completely clogged. I notice in Cloud Tasks that there are tasks running when there are no tasks in the queue. These seem to be using up instance resources and stay stuck for hours, well beyond any timeout. Once my instances get clogged up with these tasks, my other process can't seem to make requests without timing out, even with a very high timeout. Using auto-scaling, I thought App Engine was supposed to spin up more tasks to handle incoming requests (source).

The task handlers are not terribly complicated. They just do some operations on a Google Spanner database and read/write to/from GCS (IO intensive).

App configuration:

runtime: python
env: flex
service: pipeline
entrypoint: gunicorn -b :$PORT main:app --timeout 300
threadsafe: true

runtime_config:
  python_version: 3

Queue configuration:

app_engine_http_queue {
}
rate_limits {
  max_dispatches_per_second: 500.0
  max_burst_size: 100
  max_concurrent_dispatches: 1000
}
retry_config {
  max_attempts: 100
  min_backoff {
    nanos: 100000000
  }
  max_backoff {
    seconds: 3600
  }
  max_doublings: 16
}
state: RUNNING

Orphaned tasks Increasing latencies

Dan Cornilescu
  • 39,470
  • 12
  • 57
  • 97
  • 1
    You may be returning a non-2xx status code, which would fail the task and re-queue. – petomalina Apr 16 '19 at 16:27
  • In that case, Cloud Tasks would show the retries and the task would stay in the queue. The tasks in the queue don't show retries and there are still more tasks running than tasks in the queue. The only non 200 status codes I get are 502 and 504 when things start getting clogged up. – Danielle Hanks Apr 16 '19 at 16:38
  • Can you see in stackdriver logs whether any requests are completing successfully? What is the time it takes each request to complete, and is it staying constant? – David Apr 16 '19 at 21:32
  • Initially, all requests complete successfully within ms. Within a minute or two, latency increases to seconds, then minutes. When enough tasks are "stuck", no more complete successfully. – Danielle Hanks Apr 16 '19 at 21:52
  • Side note: the doc source you mentioned is for the standard environment, the flex env one is [here](https://cloud.google.com/appengine/docs/flexible/python/how-instances-are-managed). See also https://stackoverflow.com/a/45842773/4495081 – Dan Cornilescu Apr 17 '19 at 03:17
  • I'm also seeing this issue - we have an orphaned task that is "running" in our queue, even though "tasks in queue" is 0. We think it might have to do with instance scaling, but we're having trouble clearing the queue properly to know that for sure. Did you ever figure out more details or a fix? – tpw Aug 14 '19 at 00:45

0 Answers0