2

I am currently exporting my data (from a destination table in Bigquery) to a bucket in GCS. Doing this programmatically using the Bigquery API.

There is a constraint while exporting data from Bigquery to GCS - the data should not be greater than 1GB.

  • Since my data in the destination table is more than 1GB, I split the files into multiple parts.
  • The number of parts in which the file will be split will obviously depend on the size of the data that I have in my destination table.

Here is the code snippet for the function exportDataToGCS() where this is happening:

http = authorize();
bigquery_service = build('bigquery', 'v2', http=http)

    query_request = bigquery_service.jobs()

    DESTINATION_PATH = constants.GCS_BUCKET_PATH + canonicalDate + '/'
    query_data = {
                'projectId': 'ga-cnqr',
                'configuration': {
                                'extract': {
                                        'sourceTable': {
                                                    'projectId': constants.PROJECT_ID,
                                                    'datasetId': constants.DEST_TABLES_DATASET_ID,
                                                    'tableId': canonicalDate,
                                                        },
                                        'destinationUris': [DESTINATION_PATH + canonicalDate + '-*.gz'],
                                        'destinationFormat': 'CSV',
                                        'printHeader': 'false',
                                        'compression': 'GZIP'
                                            }
                                }
                  }

    query_response = query_request.insert(projectId=constants.PROJECT_NUMBER,
                                     body=query_data).execute()

After this function is executed, in my GCS bucket, my files show up in the following manner:

File parts in GCS bucket after exporting from BigQuery

However, I am curious to know whether there can be any scenarios where the file was supposed to be split in 10 parts, but only 3 parts made it to the bucket because the above function failed.

That is, could there be a partial export ?

Could reasons like network drop or the process running the function being killed etc lead to this? Is this process a blocking call? Asynchronous?

Thanks in advance.

Update 1: Status parameter in query response

This is how I am checking for the DONE status.

while True:
        status = query_request.get(projectId=constants.PROJECT_NUMBER, jobId=query_response['jobReference']['jobId']).execute()
        if 'DONE' == status['status']['state']:
            logging.info("Finished exporting for the date : " + stringCanonicalDate);
            return
activelearner
  • 7,055
  • 20
  • 53
  • 94
  • when you say "the above function failed" do you have an error message? or is your only indication of the failure the fact you're missing 3 parts? – Patrice Feb 27 '15 at 20:18
  • The query response that you get back has a parameter called 'Status'. When this is equal to "DONE", it means that the query completed. – activelearner Feb 27 '15 at 20:35
  • Check Update in my question. The first code snippet and the second code snippet are all within the try block. – activelearner Feb 27 '15 at 20:35
  • fair enough :). Status = 'DONE' doesn't mean that the errors field is empty though. You might want to look in there if there is something :). Could be that the main job is done, but one file returned an error – Patrice Feb 27 '15 at 20:39
  • I don't think I know which errors field you are implying? Could you please explain that? Thanks! – activelearner Feb 27 '15 at 21:54
  • It's a bigquery job. You have different fields (you looked at one 'status'). You just need to look at the error one :) I think the first field is 'error' and the one that's repeated, containing your errors, is 'errors' – Patrice Feb 27 '15 at 22:01

1 Answers1

4

Partial exports are possible if the job fails for some reason mid-way through execution.

If the job is in the DONE state, and there are no errors in the job, then all the data has been exported.

I recommend waiting a bit before polling for job done -- you can hit rate limit errors if you poll too quickly, and given that these async jobs are not fast, millisecond accuracy isn't needed.

Working with your sample code, here is how you could test for the presence of errors:

while True:
    status = query_request.get(projectId=constants.PROJECT_NUMBER, jobId=query_response['jobReference']['jobId']).execute() 
    if 'DONE' == status['status']['state']:
        if 'errorResult' in status['status']:
            logging.error("Error exporting for the date : " + stringCanonicalDate);
            logging.error(status['status']['errorResult'])
            return False
        logging.info("Finished exporting for the date : " + stringCanonicalDate);
        return True
    time.sleep(1)

To be super robust, you could also catch HTTP errors that occasionally occur while in a polling wait loop. It looks like you are using the python apiclient, which raises apiclient.errors.HttpError on such failures.

Michael Sheldon
  • 2,027
  • 11
  • 7