3

I have an endpoint that receives a batch of 10k-20k records. It returns a job id and launches deferred tasks to process these in parallel. It seems that sometimes one of the new instances will grab a few of the tasks, but not actually process them. It seems that the instance died instantly.

Eventually those tasks hit their 10 minute timeout and are launched again.

If I find one of these tasks and filter by the id of instance that was running it, this is what I see in Google Logs Viewer:

logs

Most of the log entries just have this message "Process terminated because the request deadline was exceeded during a loading request." The message's timestamp is 10 minutes after the timestamp of the request.

One has this stack trace:

Traceback (most recent call last):
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 240, in Handle
    handler = _config_handle.add_wsgi_middleware(self._LoadHandler())
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/lib_config.py", line 351, in __getattr__
    self._update_configs()
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/lib_config.py", line 283, in _update_configs
    self._lock.acquire()
  File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/threading.py", line 170, in acquire
    self.__count = self.__count + 1
DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded.

Another has this one:

(/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py:252)
Traceback (most recent call last):
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 240, in Handle
    handler = _config_handle.add_wsgi_middleware(self._LoadHandler())
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/lib_config.py", line 351, in __getattr__
    self._update_configs()
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/lib_config.py", line 287, in _update_configs
    self._registry.initialize()
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/lib_config.py", line 160, in initialize
    import_func(self._modname)
  File "/base/data/home/apps/s~myappid/dev.403063962077465992/appengine_config.py", line 12, in <module>
    vendor.add('lib')
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/ext/vendor/__init__.py", line 40, in add
    elif os.path.isdir(path):
  File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/genericpath.py", line 52, in isdir
    return stat.S_ISDIR(st.st_mode)
DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded.

Another has this one:

(/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py:252)
Traceback (most recent call last):
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 240, in Handle
    handler = _config_handle.add_wsgi_middleware(self._LoadHandler())
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/lib_config.py", line 351, in __getattr__
    self._update_configs()
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/lib_config.py", line 287, in _update_configs
    self._registry.initialize()
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/lib_config.py", line 160, in initialize
    import_func(self._modname)
  File "/base/data/home/apps/s~myappid/dev.403063962077465992/appengine_config.py", line 14, in <module>
    from lib import requests
  File "/base/data/home/apps/s~myappid/dev.403063962077465992/lib/requests/__init__.py", line 52, in <module>
    from .packages.urllib3.contrib import pyopenssl
  File "/base/data/home/apps/s~myappid/dev.403063962077465992/lib/requests/packages/__init__.py", line 27, in <module>
    from . import urllib3
  File "/base/data/home/apps/s~myappid/dev.403063962077465992/lib/requests/packages/urllib3/__init__.py", line 8, in <module>
    from .connectionpool import (
  File "/base/data/home/apps/s~myappid/dev.403063962077465992/lib/requests/packages/urllib3/connectionpool.py", line 29, in <module>
    from .connection import (
  File "/base/data/home/apps/s~myappid/dev.403063962077465992/lib/requests/packages/urllib3/connection.py", line 39, in <module>
    from .util.ssl_ import (
  File "/base/data/home/apps/s~myappid/dev.403063962077465992/lib/requests/packages/urllib3/util/__init__.py", line 3, in <module>
    from .connection import is_connection_dropped
  File "/base/data/home/apps/s~myappid/dev.403063962077465992/lib/requests/packages/urllib3/util/connection.py", line 145, in <module>
    HAS_IPV6 = _has_ipv6('::1')
  File "/base/data/home/apps/s~myappid/dev.403063962077465992/lib/requests/packages/urllib3/util/connection.py", line 135, in _has_ipv6
    sock.bind((host, 0))
  File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/socket.py", line 227, in meth
    return getattr(self._sock,name)(*args)
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/remote_socket/_remote_socket.py", line 663, in bind
    self._CreateSocket(bind_address=address)
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/remote_socket/_remote_socket.py", line 609, in _CreateSocket
    'remote_socket', 'CreateSocket', request, reply)
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 95, in MakeSyncCall
    return stubmap.MakeSyncCall(service, call, request, response)
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 329, in MakeSyncCall
    rpc.CheckSuccess()
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/apiproxy_rpc.py", line 133, in CheckSuccess
    elif self.exception:
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/apiproxy_rpc.py", line 136, in exception
    @property
DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded.
This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This request may thus take longer and use more CPU than a typical request for your application.

The main issue is that I need to finish processing the batch within 5-10 minutes.

Each record in the batch should only take a minute to process, so a solution would be modifying the 10 minute timeout, but Google support said that isn't possible.

I tried implementing warmpup requests to try to address the loading requests, but that seemed to have no impact.

Dan Cornilescu
  • 39,470
  • 12
  • 57
  • 97
Alex
  • 5,141
  • 12
  • 26
  • `Each record in the batch should only take a minute to process` - how did you estimate this? What's your deferred task doing? Your `divide-sub-batches` handler also seems to take too long... – Dan Cornilescu Aug 01 '17 at 02:07
  • Thats estimated based on the successful tasks. Everything else finishes on time except for these stragglers. Each deferred task performs up to 3 ndb queries, an outbound python-requests call to braintree, and 1 put. And yea when this happens, all requests hit that timeout (which is 600s for push queue tasks) – Alex Aug 01 '17 at 02:35
  • All 3 tracebacks seem to suggest your app was still initializing, didn't even get to the actual handler code, so something's really odd. Can't guess without the code, tho. – Dan Cornilescu Aug 01 '17 at 13:05
  • I think the problem was GAE's beta socket & ssl library. The python requests seems to be doing more and more with each release to handle app engine better. Upgrading python requests fixed the issue. I posted an answer with more details. – Alex Aug 01 '17 at 19:18

2 Answers2

1

My previous answer worked for a glorious 24 hrs before the issue appeared again. It felt like the issue became less frequent after upgrading python requests from 2.12.3 to 2.18.2, but it's hard to say.

Regardless the solution that did ultimately work was editing the source code of urllib3 (it has been 5 days now, and the issue has not happened).

In the file urllib3/util/connection.py, hardcode _has_ipv6(host) to always return false (it always does in app engine anyway)

def _has_ipv6(host):
    """ Returns True if the system can bind an IPv6 address. """
+   return False
    sock = None
    has_ipv6 = False

    if socket.has_ipv6:
        # has_ipv6 returns true if cPython was compiled with IPv6 support.
        # It does not tell us if the system has IPv6 support enabled. To
        # determine that we must bind to an IPv6 address.
        # https://github.com/shazow/urllib3/pull/611
        # https://bugs.python.org/issue658327
        try:
            sock = socket.socket(socket.AF_INET6)
            sock.bind((host, 0))
            has_ipv6 = True
        except Exception:
            pass

    if sock:
        sock.close()
    return has_ipv6


HAS_IPV6 = _has_ipv6('::1')

I believe the problem is ultimately the call to sock.bind((host, 0)) which hangs and according to google's docs https://cloud.google.com/appengine/docs/standard/python/sockets/

You cannot bind to specific IP addresses or ports.

I was able to create an isolated GAE sample project that show cases the issue. Install these third party libraries pip install requests requests_toolbelt -t lib to a folder called lib and create these files:

app.yaml:

application: #your project id here
version: dev
runtime: python27
api_version: 1
threadsafe: true
inbound_services:
- warmup

automatic_scaling:
  min_idle_instances: 0
  max_concurrent_requests: 8 # default value

env_variables:
  GAE_USE_SOCKETS_HTTPLIB : 'true'

builtins:
- appstats: on #/_ah/stats/
- remote_api: on #/_ah/remote_api/
- deferred: on

handlers:
- url: /.*
  script: main.app

libraries:
- name: jinja2
  version: "2.6"
- name: webapp2
  version: "2.5.2"
- name: markupsafe
  version: "0.15"
- name: ssl
  version: "2.7.11"
- name: pycrypto
  version: "2.6"
- name: lxml
  version: latest

main.py:

import webapp2
import requests
import time

from google.appengine.api.taskqueue import taskqueue
from google.appengine.api import app_identity
from google.appengine.ext import deferred, ndb


class MainHandler(webapp2.RequestHandler):
    def get(self):
        self.response.write('''<form method="POST"><input type="submit" value="Launch"></form>''')

    def post(self):
        queue = taskqueue.Queue()
        futures = [queue.add_async(taskqueue.Task(url="/task")) for _ in xrange(0, 2000)]
        ndb.Future.wait_all(futures)
        print 'launched'
        self.get()


class TaskHandler(webapp2.RequestHandler):
    def post(self):
        try:
            r = requests.post("https://"+app_identity.get_application_id()+".appspot.com/post")
            print r.text
        except Exception as e:
            print str(e)


class RequestFromTaskHandler(webapp2.RequestHandler):
    def post(self):
        time.sleep(2)
        self.response.write('responded')


app = webapp2.WSGIApplication([
    ('/', MainHandler),
    ('/_ah/warmup', MainHandler),
    ('/task', TaskHandler),
    ('/post', RequestFromTaskHandler),
], debug=True)

appengine_config.py:

import os

from google.appengine.ext import vendor

vendor.add('lib')

import requests
from requests_toolbelt.adapters import appengine as requests_toolbelt_appengine

# Use the App Engine Requests adapter. This makes sure that Requests uses
# URLFetch.
requests_toolbelt_appengine.monkeypatch()

queue.yaml:

queue:
- name: default
  rate: 100/s
  bucket_size: 500
  max_concurrent_requests: 1000

Essentially a POST to MainHandler will launch 2000 deferred tasks that try to create an outbound request (to itself).

Go here to monitor the batch: https://console.cloud.google.com/appengine/taskqueues You should see most of the tasks finish up quickly except for a few which will continue to say that they are running for 10 minutes. enter image description here

After they finish they will then appear in logs viewer.

For some reason if I don't disable & then re-enable the project, the issue only happens on the first batch of 2000.

After you have been able to replicate the issue. If you make the change to urllib3, the issue shouldn't happen anymore.

This was for urllib3 version 1.22

Alex
  • 5,141
  • 12
  • 26
  • This is an excellent writeup of the problem. It looks like urllib3 does not export that function, and you do indeed need to edit the file. – speedplane Apr 15 '21 at 20:13
  • I'll add however, that I don't think this is really the issue. I SSHed into my live server, and did `sock.bind(('::1', 0))` and I got an immediate exception, not a timeout. – speedplane Apr 15 '21 at 20:26
0

it seems to be that you are running an Automatic scaling instance which has a time limit of ten minutes for queues. You can try to switch to a different instance type like Manual scaling. Please see the following link: https://cloud.google.com/appengine/docs/standard/python/an-overview-of-app-engine

Michael Meyer
  • 2,179
  • 3
  • 24
  • 33
  • I am using Automatic scaling. I did notice that, but I was worried that abandoning google's dynamic scaling would be an entire project itself. – Alex Aug 01 '17 at 19:19