4

I'm trying to automate JSON data upload to BigQuery using two cloud functions which both deploys successfully and a cloud scheduler which runs successfully. After running the cloud scheduler, data gets uploaded to my cloud storage, but then it doesn't get uploaded to BigQuery.

Below are my code and JSON data:

# function 1 triggered by http
def function(request):
    url = "https://api...."
    headers = {"Content-Type" : "application/json",
            "Authorization" : "..."}
        
    response = requests.get(url, headers=headers)

    json_data = response.json()
    pretty_json = json.dumps(json_data, indent=4, sort_keys=True)

    storage_client = storage.Client()
    bucket = storage_client.bucket("bucket_name")
    blob = bucket.blob("blob_name")

    blob.upload_from_string(pretty_json)
# function 2 triggered by cloud storage -> event type finalize/create
def function_2(data, context):
    client = bigquery.Client()

    table_id = "booming-post-322920:dataset_name.table_name"

    job_config = bigquery.LoadJobConfig()
    job_config.schema=[
        bigquery.SchemaField("order_items", "INTEGER"),
        bigquery.SchemaField("created_at", "TIMESTAMP"),
        .....,     
        bigquery.SchemaField("updated_at", "TIMESTAMP")
    ]

    job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON

    uri = 'gs://bucket_name/blob_name' 

    load_job = client.load_table_from_uri(
        uri,
        table_id,
        location="US",  
        job_config=job_config
    ) 

    load_job.result()  

This is what my JSON data pretty_json looks like:

{
    "records": [
        {
            "active": null,
            "approved": null,
            "buyer": [
                1
            ],
            "cancel_reason": null,
            "cancelled": null,
            "chef": [
                1
            ],
            "completed": null,
            "created_at": "2021-07-15T17:44:31.064Z",
            ...

Please advise.

  • 2
    Did you mean to say that the JSON is not getting loaded into BigQuery? Your problem description mentions issues around storage only. Assuming it's BigQuery that does not get loaded with the data, how are you triggering the function_2? Do you run both functions at the same time? Wondering if there could be a race condition where the json is still being uploaded to the bucket and the function_2 has been already triggered. – CaioT Aug 17 '21 at 12:21
  • No my trigger for function 2 is when the cloud storage gets updated. I'm suspecting maybe a problem with JSON format and bigquery. –  Aug 17 '21 at 12:33
  • @CaioT Yes JSON is not getting loaded into BigQuery –  Aug 17 '21 at 12:36
  • And from where are you getting the variable value 'blob_name' to build the URI uri = 'gs://bucket_name/blob_name' ? – CaioT Aug 17 '21 at 12:45
  • @CaiOT from here https://stackoverflow.com/questions/25373467/how-do-i-identify-the-google-cloud-storage-uri-from-my-google-developers-console –  Aug 17 '21 at 14:58

1 Answers1

2

I think the main problem is the format of your JSON file: you are specifying newline delimited JSON format (bigquery.SourceFormat.NEWLINE_DELIMITED_JSON) as required by BigQuery, but your JSON doesn't conform to that format.

Please, consider the following modifications to your first function:

def function(request):
    url = "https://api...."
    headers = {"Content-Type" : "application/json",
            "Authorization" : "..."}
        
    response = requests.get(url, headers=headers)

    json_data = response.json()
    
    records = [json.dumps(record) for record in json_data["records"]]
    records_data = "\n".join(records)

    storage_client = storage.Client()
    bucket = storage_client.bucket("bucket_name")
    blob = bucket.blob("blob_name")

    blob.upload_from_string(records_data)

Your JSON will look like the following now:

{"active": null, "approved": null, "buyer": [1], "cancel_reason": null, "cancelled": null, "chef": [1], "completed": null, "created_at": "2021-07-15T17:44:31.064Z", "delivery": false, "delivery_address": null, "delivery_fee": null, "delivery_instructions": null, "discount": 0, "id": 1, "name": "Oak's Order", "notes": null, "order_delivery_time": null, "order_id": null, "order_ready_time": null, "order_submitted_time": null, "paid": null, "pickup_address": "", "promo_applied": null, "promo_code": null, "rated": null, "ratings": null, "review": null, "seller": [1], "status": "In Process", "tax": null, "tip": 0, "total": null, "type": "Pick Up", "updated_at": "2021-07-15T17:44:31.064Z"}
{"active": null, "approved": null, "buyer": [2], "cancel_reason": null, "cancelled": null, "chef": [1], "completed": null, "created_at": "2021-07-15T17:52:53.729Z", "delivery": false, "delivery_address": null, "delivery_fee": null, "delivery_instructions": null, "discount": 0, "id": 2, "name": "Shuu's Order", "notes": null, "order_delivery_time": null, "order_id": null, "order_ready_time": null, "order_submitted_time": null, "paid": null, "pickup_address": "", "promo_applied": null, "promo_code": null, "rated": null, "ratings": null, "review": null, "seller": [1], "status": "In Process", "tax": null, "tip": 0, "total": null, "type": "Pick Up", "updated_at": "2021-07-15T17:52:53.729Z"}

In addition, in your second function, as also pointed out by @CaioT in his/her comment, you need to change your function signature to accept two arguments, event and context, according to the GCS storage trigger event definition.

In addition, please, consider review the definition of the order_items fields in the BigQuery schema definition, according to your JSON that field not exists.

Pay attention to the limitations imposed by BigQuery when importing JSON data as well, especially when dealing with timestamps.

Finally, be sure your function has the necessary permissions to interact with BigQuery.

By default, at runtime your function will assume your App Engine service account although you can provide a specific service account as well. Be sure that in any case the service account has the necessary permissions over BigQuery and your BigQuery table. Basically your service account must be bigquery.user and be WRITER (or equivalently, bigquery.dataEditor) of your dataset. Please, see the examples provided in the GCP documentation.

jccampanero
  • 50,989
  • 3
  • 20
  • 49
  • I updated the JSON format. Still no data uploaded to BigQuery. My function_2 trigger is Cloud Storage -> Finalize/Create –  Aug 17 '21 at 15:31
  • I am happy to hear that the problem is partially solved. I would try debugging your function. You can see the function execution logs in several ways (the browser UI, the `gcloud` command line). Please, see for instance [this related SO question](https://stackoverflow.com/questions/44556308/where-does-console-log-info-showup-for-google-cloud-functions). – jccampanero Aug 17 '21 at 16:33
  • Thanks. Now I see that I have this error in function_2 `TypeError: function_2() takes 1 positional argument but 2 were given` though it deployed successfully. –  Aug 17 '21 at 16:39
  • 1
    Your function_2 should take two arguments (if it's a GCS PubSub notification): function_2(event, context) – CaioT Aug 17 '21 at 16:50
  • Got it, fixed that. Still no luck. function_1 triggered by http. function_2 triggered by cloud storage so function_2(data, context) –  Aug 17 '21 at 16:57
  • Thank you for pointed that out @CaioT, that is right. I suppose you mean a [GCS storage trigger](https://cloud.google.com/functions/docs/calling/storage). Regarding the current issue, it looks like you haven't got the necessary permissions. [By default](https://cloud.google.com/functions/docs/concepts/iam), your function will assume your App Engine service account although you can [provide a specific service account as well](https://cloud.google.com/functions/docs/securing/function-identity#individual). Be sure that in any case the service account has the necessary ... – jccampanero Aug 17 '21 at 17:09
  • ... [permissions](https://cloud.google.com/bigquery/docs/access-control) over bigquery and your bigquery table. – jccampanero Aug 17 '21 at 17:09
  • Currently, my functions and cloud scheduler are running with App Engine default service account. Wouldn't I get some errors when deploying if it had no permission? –  Aug 17 '21 at 17:13
  • Probably the problem is related to the BigQuery permissions. That service account must have the permissions to access BigQuery and your BigQuery information. Please, see the updated answer. In addition, no, the function can be deployed without problem because the permissions are only evaluated at runtime. – jccampanero Aug 17 '21 at 17:23
  • Thank you, I have done all the above suggestions. Does Cloud Scheduler also need BigQuery permissions? –  Aug 17 '21 at 17:55
  • You are welcome, I am happy to help. I do not think so, your permission problem happens in the second function and this function is triggered by a GCS storage event, not by Cloud Scheduler. – jccampanero Aug 17 '21 at 18:13
  • Got it working! Yes the problem was with my incoming JSON data from app side. –  Aug 19 '21 at 10:26
  • That is great @amnesic. I am very happy to hear that it is working properly!! – jccampanero Aug 19 '21 at 13:44