How can I optimize performance when inserting massive data in BigQuery dataset with a user-defined function

Question

I am just looking for way to add massive data into BigQuery table recently using UDF. So, I tried the recommended method like this:

#standardSQL
INSERT INTO `project.dataset.Quincy` (id, col)
WITH array_to_loop_through AS (
  SELECT id 
  FROM UNNEST(GENERATE_ARRAY(1, 1000, 1)) id
)
SELECT id, CONCAT('Rank: ', CAST(id AS STRING))
FROM array_to_loop_through

This took 8 seconds to add 1 million values to the table. So I applied this way on my UDF:

CREATE TEMPORARY FUNCTION myFunc()
  RETURNS array<string>
  LANGUAGE js AS
"""
a=[""];
for(i=0;i<=50;i++){
    a.push(randomString(12));
    }
    return a;
"""
OPTIONS (
library="gs://kaneki110299/tester.js"

);



#standardSQL

INSERT INTO `Lambert.fortune` (password)
WITH array_to_loop_through AS (
  SELECT * 
  FROM UNNEST(myFunc()) id
)
SELECT CONCAT(CAST(id AS STRING))
FROM array_to_loop_through

When I run this query on BigQuery, it runs for 5 minutes then encounters a UDF time out with just 50 values. The same error happened when I put the loop inside tester.js. So, I tried a different way:

CREATE TEMPORARY FUNCTION myFunc()
  RETURNS string
  LANGUAGE js AS
"""   
    return randomString(12);
"""
OPTIONS (
library="gs://kaneki110299/tester.js"

);



#standardSQL

INSERT INTO `Lambert.fortune` (password) 
Values (myFunc()),(myFunc()),(myFunc())...//1000 times

Unlike the previous query, this one only takes 30 seconds to add 1000 values from my UDF result to the table. It looks like the loop didn't work well or fast on BigQuery.

Is it possible to use parallel or BigQuery support any way to optimize their CPU performance when running user-defined function to insert massive data to its dataset? I tried to add 1 billion values on the table so the last method I use doesn't seem practical.

Bigquery run everything is parallel this is why it's such an amazing big data solution. You should check your execution report which is located in the details tab of the webui to see why the execution takes time. Can if you can provide a screenshot it will easier to troubleshoot your problem. — Tamir Klein, Apr 09 '19 at 05:55
I don't know what screenshot to put in this. You just have to copy the query, run it on query editor & it return the result, the randomString function i used was from this https://stackoverflow.com/questions/1349404/generate-random-string-characters-in-javascript . I was hoping if i can combine a bigquery default function with my UDF to generate big array then insert into table like 'GENERATE_ARRAY(1, 1000, 1)' — Toan Nguyen, Apr 09 '19 at 07:31
@TamirKlein it's a bit out of question. Can i use require() from node.js to call external library in my js file? Does BigQuery allow that? — Toan Nguyen, Apr 09 '19 at 10:43

score 0 · Accepted Answer · answered Apr 09 '19 at 11:28

Using googleApi npm you can write a JS program which will run multiple inserts in parallel.

This is a full working mocha test on how to use the API for 1-time call. You can wrap the inner call with your own for loop to get the insert done in parallel.

if (!global._babelPolyfill) {
    var a = require("babel-polyfill")
}

import {google} from 'googleapis'

let bigQuery = google.bigquery("v2")

describe('Run query with API', async () => {

    it('Run a query', async () => {
        let result = await test('panada')

    })

    async function test(p1) {
        try {
            let query = `INSERT INTO \`project.dataset.Quincy\` (id, col)
                        WITH array_to_loop_through AS (
                          SELECT id 
                          FROM UNNEST(GENERATE_ARRAY(1, 10, 1)) id
                        )
                        SELECT id, CONCAT('Rank: ', CAST(id AS STRING))
                        FROM array_to_loop_through`

            let auth = getBasicAuthObj()
            auth.setCredentials({
                access_token: "myAccessToken",
                refresh_token: "myRefreshToken"
            })

            let request = {
                "projectId": "myProject",
                auth,
                "resource": {
                    "projectId": "myProject",
                    "configuration": {
                        "query": {
                            query,
                            "useLegacySql": false
                        },
                        "dryRun": false
                    }
                }
            }

            console.log(`query is: ${query}`)

            let result = await callBQ(request)

            // debugger
            console.log(`Status is: ${result.data.status.state}`)
        } catch (err) {
            console.log("err", err)
        }
    }

    /**
     * Call BigQuery jobs.insert
     * @param request
     * @returns {Promise<*>}
     */
    async function callBQ(request) {
        debugger
        // console.log("request", request)
        try {
            let result = await bigQuery.jobs.insert(request, request)
            console.log(`All good.....`)

            return result
        } catch (e) {
            console.log(`Failed to run query: ${e}`)
        }

    }

    /**
     * Create oAuth object
     * @returns {OAuth2Client}
     */
    function getBasicAuthObj() {
        let clientId = 'myclientId'
        let clientSecret = 'mySecret'
        let redirectUrl = 'URL'

        return new google.auth.OAuth2(
            clientId,
            clientSecret,
            redirectUrl
        )
    }
})

Note: Bigquery has limits when doing insert and running queries in parallel, See this link for more details

Do i have to run this code in a outside program? 'Cause i just put this in my js file & there're error show up, also bigquery doesn't allow me to use require() in my file to link to external library. Also, does this code faster compare to my code? — Toan Nguyen, Apr 11 '19 at 03:47
Yes the idea is to run this outside of bigquery and add you own layer of parallelism using js async and promiseAll capabilities. So for example you can have 10 promise running in parallel all creating records in your target table for a different sets of id range. — Tamir Klein, Apr 11 '19 at 07:25
@ToanNguyen can you please also vote-up this answer. This will also help you grow your reputation see this link how to do this: https://stackoverflow.com/help/someone-answers — Tamir Klein, Apr 13 '19 at 05:50

How can I optimize performance when inserting massive data in BigQuery dataset with a user-defined function

1 Answers1