I am just looking for way to add massive data into BigQuery table recently using UDF. So, I tried the recommended method like this:
#standardSQL
INSERT INTO `project.dataset.Quincy` (id, col)
WITH array_to_loop_through AS (
SELECT id
FROM UNNEST(GENERATE_ARRAY(1, 1000, 1)) id
)
SELECT id, CONCAT('Rank: ', CAST(id AS STRING))
FROM array_to_loop_through
This took 8 seconds to add 1 million values to the table. So I applied this way on my UDF:
CREATE TEMPORARY FUNCTION myFunc()
RETURNS array<string>
LANGUAGE js AS
"""
a=[""];
for(i=0;i<=50;i++){
a.push(randomString(12));
}
return a;
"""
OPTIONS (
library="gs://kaneki110299/tester.js"
);
#standardSQL
INSERT INTO `Lambert.fortune` (password)
WITH array_to_loop_through AS (
SELECT *
FROM UNNEST(myFunc()) id
)
SELECT CONCAT(CAST(id AS STRING))
FROM array_to_loop_through
When I run this query on BigQuery, it runs for 5 minutes then encounters a UDF time out with just 50 values. The same error happened when I put the loop inside tester.js
. So, I tried a different way:
CREATE TEMPORARY FUNCTION myFunc()
RETURNS string
LANGUAGE js AS
"""
return randomString(12);
"""
OPTIONS (
library="gs://kaneki110299/tester.js"
);
#standardSQL
INSERT INTO `Lambert.fortune` (password)
Values (myFunc()),(myFunc()),(myFunc())...//1000 times
Unlike the previous query, this one only takes 30 seconds to add 1000 values from my UDF result to the table. It looks like the loop didn't work well or fast on BigQuery.
Is it possible to use parallel or BigQuery support any way to optimize their CPU performance when running user-defined function to insert massive data to its dataset? I tried to add 1 billion values on the table so the last method I use doesn't seem practical.