1

I am uploading a file that is >10MB from App Engine to Google Cloud Storage via the code below.

gcs.bucket(bucket_name).blob(blob_name=file_path).upload_from_string(data, content_type=content_type)

I am using the GCS Python Client Library and not the built-in App Engine library because I am composing multiple >10MB files into a single file in Cloud Storage when the process is complete.

The code is running in a task and has 10 minutes to get the data and upload the information as a CSV to GCS. The data is retrieved and converted into a CSV formatted string in less than 3 minutes. The code then tries to uploading the data to GCS, Stackdriver logging stops receiving logs and I wait ~10 minutes at which point I receive a flood of logs in Stackdriver up to the point of failure with the failure being:

DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded.

This issue is frustrating because of 2 things.

  1. This error is intermittent.
  2. Once 1 file succeeds they all succeed in seconds.

    1. During initial develop the issue never occurred. Only recently has the issue started to appear and is becoming more frequent.

    2. The first >10MB file always takes minutes to fail or succeed. Fails after 10 minutes, but may take anywhere from 1 to 9 minutes and then succeed. Once a file succeeds all future uploads of >10MB files take ~5-10 seconds.

My theory is that there is some service that App Engine is using to upload the files to Google Cloud Storage that automatically goes to sleep after a certain time of no usage. When the service is asleep it takes a very long time to wake it back up. Once the service is awake it can upload to GCS without any issues, very quickly.

Has anyone else run into this or have ideas on how to solve it?

UPDATE

Full error:

(/base/alloc/tmpfs/dynamic_runtimes/python27g/3b44e98ed7fbb86b/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py:279) Traceback (most recent call last): File "/base/alloc/tmpfs/dynamic_runtimes/python27g/3b44e98ed7fbb86b/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 267, in Handle result = handler(dict(self._environ), self._StartResponse) File "/base/data/home/apps/s~pg-gx-n-app-200716/worker:20181030t154529.413639922318911836/lib/flask/app.py", line 2309, in __call__ return self.wsgi_app(environ, start_response) File "/base/data/home/apps/s~pg-gx-n-app-200716/worker:20181030t154529.413639922318911836/lib/flask/app.py", line 2292, in wsgi_app response = self.full_dispatch_request() File "/base/data/home/apps/s~pg-gx-n-app-200716/worker:20181030t154529.413639922318911836/lib/flask/app.py", line 1813, in full_dispatch_request rv = self.dispatch_request() File "/base/data/home/apps/s~pg-gx-n-app-200716/worker:20181030t154529.413639922318911836/lib/flask/app.py", line 1799, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "/base/data/home/apps/s~pg-gx-n-app-200716/worker:20181030t154529.413639922318911836/worker.py", line 277, in cache_records cache_module.save_records(records=records, report_fields=report.report_fields, report_id=report.report_id, header=header_flag) File "/base/data/home/apps/s~pg-gx-n-app-200716/worker:20181030t154529.413639922318911836/storage/user/user.py", line 110, in save_records user_entry = User.__generate_user_csv(user=user, report_fields=report_fields) File "/base/data/home/apps/s~pg-gx-n-app-200716/worker:20181030t154529.413639922318911836/storage/user/user.py", line 55, in __generate_user_csv for index, attrib in enumerate(report_fields): DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded.

JMKrimm
  • 224
  • 1
  • 13

1 Answers1

0

So that 'failing after 10 minutes' sounds very similar to an issue that I experienced a while back where sometimes processes on a new instance would just hang until they hit their timeout before dieing:

app engine instance dies instantly, locking up deferred tasks until they hit 10 minute timeout

Can you provide the full traceback? And try filtering by instance id in the logs to see if anything else crashed at the same time.

Some generic quick-fixes to try would be:

  1. implementing warmup-requests https://cloud.google.com/appengine/docs/standard/python/configuring-warmup-requests
  2. bumping up your instance class size https://cloud.google.com/appengine/docs/standard/#instance_classes
  3. Isolate this task to run on a separate microservice so that it doesnt have to compete for resources with the rest of your request handlers https://cloud.google.com/appengine/docs/standard/python/microservices-on-app-engine
Alex
  • 5,141
  • 12
  • 26
  • 1. The issue persists even if the instance has been up for awhile. 2. I am using the highest instance of F4_1G. 3. The issue still occurs when the task is the only task running. – JMKrimm Nov 01 '18 at 11:20
  • Hmm, yea I'm not sure. When you say `The code is running in a task and has 10 minutes to get the data` where is it getting the data from? Taskqueue & app engine in general is geared towards handling numerous small tasks. I use Google Dataflow to export Datastore records to GCS as 100MB CSV files. You should use that if you can, and if not then possibly look into GAE Flex where you'll have more control & longer timeouts. – Alex Nov 02 '18 at 00:04
  • Thanks Alex. I am pulling the data from the G Suite directory API. I am doing batches of 20,000 users, converting the JSON return into a CSV formatted string and then saving the CSV string in GCS. I have determined the timeout is because of the conversion from JSON to CSV. For some reason the processing power fluctuates widely. I ran a test and it took 8 minutes to convert to CSV and then immediately ran it again and it took 3 minutes. My solution has been to reduce the batch from 20,000 to 10,000. Hopefully this will ensure the wide fluctuation will stay within the 10 minutes. – JMKrimm Nov 05 '18 at 11:55