5

UPDATE: I've narrowed the problem a bit, so I'm removing unnecessary code and examples:

UPDATE 2: after leaving the cron job running at 12 hour intervals for quite some time (with each ending in success, but with nothing written in BQ) we were shocked to discover that, after about a week, one of the cron jobs did successfully write to BigQuery, along with Stackdriver logs stating that "this request caused a new process to be started for your application (...)", as copied below. The following jobs again stopped writing. Now I'm wondering if this is somehow connected to either cached app state (with some expiration period) or credential expiration date which somehow prevent further writes to BigQuery after the first, but do not result in errors.

Problem description:

I'm am trying to set up a cron job in App Engine (standard) to query from and write data back to BigQuery (datasets are in the same project as the deployed app) and cron jobs execute successfully but only write to BigQuery on the first execution after deployment, afterwards they still execute with success but do not write.

The main difference I've found is in Stackdriver logs, for executions which correctly write there are extra debug and informations, for subsequent ones there are no such messages:

2018-04-19 04:44:03.933 CEST
Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None) (/base/data/home/apps/e~<redacted>/lib/urllib3/util/retry.py:200)
2018-04-19 04:44:04.154 CEST
Making request: POST https://accounts.google.com/o/oauth2/token (/base/data/home/apps/e~<redacted>/lib/google/auth/transport/requests.py:117)
2018-04-19 04:44:04.160 CEST
Starting new HTTPS connection (1): accounts.google.com (/base/data/home/apps/e~<redacted>/lib/urllib3/connectionpool.py:824)
2018-04-19 04:44:04.329 CEST
https://accounts.google.com:443 "POST /o/oauth2/token HTTP/1.1" 200 None (/base/data/home/apps/e~<redacted>/lib/urllib3/connectionpool.py:396)
2018-04-19 04:44:04.339 CEST
Starting new HTTPS connection (1): www.googleapis.com (/base/data/home/apps/e~<redacted>/lib/urllib3/connectionpool.py:824)
2018-04-19 04:44:04.802 CEST
https://www.googleapis.com:443 "POST /bigquery/v2/projects/<redacted>/jobs HTTP/1.1" 200 None (/base/data/home/apps/e~<redacted>/lib/urllib3/connectionpool.py:396)
2018-04-19 04:44:04.813 CEST
This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This request may thus take longer and use more CPU than a typical request for your application.

I've tried:

  • Adding BigQuery DataOwner and User permissions for default appengine service account, but there was no effect.

  • there are mentions that google.cloud library is not fully supported for standard app engine, so I tried using OAuth2/httplib2/googleapiclient credentials to authenticate but this is the first time I've tried it and I don't understand how to put the pieces together, and without google.cloud library I don't know how to even write a proper query for BQ

  • other credential setup approaches as suggested below but it seems that connecting to BQ is not the issue, they all connect and write (once), just repeating it within already deployed app engine.

Below is the full implementation:

app.yaml:

runtime: python27
api_version: 1
threadsafe: true

handlers:
- url: /bigquerycron
  script: bigquerycron.app
  login: admin

libraries:
- name: ssl
  version: latest

env_variables:
  GAE_USE_SOCKETS_HTTPLIB : 'true'

bigquerycron.py

from __future__ import absolute_import
from google.cloud import bigquery
import webapp2

class MainPage(webapp2.RequestHandler):
    def get(self):
        self.response.headers['Content-Type'] = 'text/plain'
        self.response.write('CRON test page')          

def writeDataTest(dataset_id = '<redacted>',table_id='<redacted>'):
    client = bigquery.Client.from_service_account_json("credentials.json")
    job_config = bigquery.QueryJobConfig()
    table_ref = client.dataset(dataset_id).table(table_id)
    job_config.destination = table_ref
    job_config.write_disposition = 'WRITE_APPEND'

    query_job = client.query(
    """SELECT CURRENT_DATETIME() AS Datetime, 'CRON' as Source""", job_config=job_config)

writeDataTest()

app = webapp2.WSGIApplication([
      ('/bigquerycron', MainPage),
], debug=True)

cron.yaml:

cron:
- url: /bigquerycron
  schedule: every 30 minutes
Vaeqal
  • 73
  • 8
  • Are you running this on a compute engine or are you logging into the cloud shell every time you want to perform the operation? – Ben P Apr 16 '18 at 10:50
  • This is running as an app deployed with App Engine, for testing I run these scripts (executing .py file manually) through cloud shell and with that I have no issues, but the problem is when it's scheduled with automated cron jobs. – Vaeqal Apr 16 '18 at 11:16
  • Have you tried scheduling with Crontab on the instance instead of Cron jobs, just to eliminate this as a potential issue? Also, you could explicitly add .json client credentials to your Python code to use a service account instead? – Ben P Apr 16 '18 at 14:06
  • I want to use cron jobs because that's just the first use case we want to implement with it. As for credentials I never worked with them before (any form of authentication really), could you modify the first bigquerycron.py from my post with an example so I could test it? – Vaeqal Apr 17 '18 at 07:20
  • I use `client = bigquery.Client.from_service_account_json("my_file.json")` in my automated Python script, where this file is in the local directory on the instance. Try using this in your code and adding in your .json credentials file. – Ben P Apr 17 '18 at 08:51
  • Tried it and there is no change, but I have no idea how to debug it, whether the file is in the wrong place or permissions are wrong. I placed the file in the same folder where the .py file is and service account has Editor permissions, same as me. I even tried to add BigQuery DataOwner and BigQuery user roles to it, but again I can't even compare the results. Code executes successfully, but fails somewhere on BQ side and leaves no logs as to why. – Vaeqal Apr 17 '18 at 14:14
  • are you running in standard appengine or flexible? – Willian Fuks Apr 19 '18 at 23:54
  • this is standard appengine and I needed to add *env_variables* in app.yaml to make it work at all – Vaeqal Apr 20 '18 at 07:40

3 Answers3

2

Credentials were not the problem in this specific case, the issue is simply with placement of the function call due to misunderstanding of how App Engine works. Function call for bigquery should be moved inside the MainPage class definition, fixed bigquerycron.py looks like this (only a single line of code is moved):

from __future__ import absolute_import
from google.cloud import bigquery
import webapp2

class MainPage(webapp2.RequestHandler):
    def get(self):
        self.response.headers['Content-Type'] = 'text/plain'
        self.response.write('CRON test page')          
        writeDataTest()

def writeDataTest(dataset_id = '<redacted>',table_id='<redacted>'):
    client = bigquery.Client.from_service_account_json("credentials.json")
    job_config = bigquery.QueryJobConfig()
    table_ref = client.dataset(dataset_id).table(table_id)
    job_config.destination = table_ref
    job_config.write_disposition = 'WRITE_APPEND'
    query_job = client.query(
    """SELECT CURRENT_DATETIME() AS Datetime, 'CRON' as Source""", job_config=job_config)

app = webapp2.WSGIApplication([
      ('/bigquerycron', MainPage),
], debug=True)

Version in OP indeed only writes once to BigQuery, when App Engine app is loaded for the first time, all subsequent calls just execute MainPage class, which in this case does nothing, as actual BigQuery code is outside of it.

Additionally it would be beneficial to rewrite the app without using google-cloud-python library, which is not supported in GAE Standard (https://github.com/GoogleCloudPlatform/google-cloud-python/issues/1893). This is especially unfortunate since even official bigquery documentation for python (https://cloud.google.com/bigquery/docs/) makes use of this library. There is a variety of workarounds to keep using it however, including some mentioned in the linked github issue and also here: Using gcloud-python in GAE and a similar workaround was used in this example.

But as mentioned, it would be best to use a dedicated Google API Client Library for Python: https://developers.google.com/api-client-library/python/

Vaeqal
  • 73
  • 8
0

I suspect that if you remove the "login: admin" part of the app.yaml, it will work.

If that's the issue, please ensure you have the proper X-Appengine header setup

Here's some doc for task queues and for cron jobs.

Ying Li
  • 2,500
  • 2
  • 13
  • 37
  • All it did was open the url to be accessed by anyone, both admin login and cron jobs could already access and trigger the page. Code execution on app engine side isn't the problem but rather what happens when BigQuery is accessed. – Vaeqal Apr 18 '18 at 07:45
0

Although I am not sure of the reason, I think authorizing the service account of App engine is not enough for accessing BigQuery.

In order to authorize your App to access to BigQuery you can do either of both methods:

  1. Within the app.yaml file, configure an environment variable pointing to a a service account key file with a correct authorization configuration to BigQuery :

    env_variables: GOOGLE_APPLICATION_CREDENTIALS=[YOURKEYFILE].json

  2. Your code executes a fetch of an authorized service account key from a bucket, and loads it afterwards with the help of the Cloud storage Client library. Seeing your runtime is python, the code you should use is the following:

    ....

    from google.cloud import storage

    ....

    def download_key():

    """Downloads the key from the bucket."""

    storage_client = storage.Client()

    bucket = storage_client.get_bucket('YOURBUCKET')

    blob = bucket.blob('Keynameinthebucket.json')

    blob.download_to_filename('Keynameinyourapp.json')

    ....

    #within the code:

    download_key()

    client=bigquery.Client.from_service_account_json('keynameinyourapp.json')

Ggrimaldo
  • 327
  • 1
  • 7
  • Method 1 does not change the way the app behaves. With method 2 I get an *Read-only file system: 'Keynameinyourapp.json'* error, but I assume even after fixing that it would not impact the issue. I can write to BigQuery but only once per deployment with some exceptions (I made an edit to the OP) – Vaeqal May 30 '18 at 14:02