A Requirements:
Let’s assume below simplified case/scenario
1 We have one "big" table:
TableAll
Row a b c
1 1 11 12
2 1 13 14
3 1 15 16
4 1 17 18
5 2 21 22
6 2 23 24
7 2 25 26
8 2 27 28
9 3 31 32
10 3 33 34
11 3 35 36
12 3 37 38
2 We need to split data to separate "smaller" tables partitioned by filed "a"
TableA1
Row b c
1 11 12
2 13 14
3 15 16
4 17 18
TableA2
Row b c
1 21 22
2 23 24
3 25 26
4 27 28
TableA3
Row b c
1 31 32
2 33 34
3 35 36
4 37 38
3 Problem to address
Most straightforward way is to issue three separate statements with writing output to respectively TableA1, TableA2, TableA3
SELECT b, c FROM TableAll WHERE a = 1;
SELECT b, c FROM TableAll WHERE a = 2;
SELECT b, c FROM TableAll WHERE a = 3;
Pros: Fast and Furious!
Cons: We need as many table scans of whole table (full cost) as many distinct value of "a" we have (in this particular case just three, but in real life it can be let’s say up to N=1K distinct values).
So final Cost is $5 * N * SizeInTB(TableAll)
Our Target Goal
We want to minimize cost as much as possible
ideally down to fixed price of $5 * SizeInTB(TableAll)
B Possible Solution (Idea and simple implementation):
Logical Step 1 – transform data to be presented as below (transform columns into JSON)
Row a json
1 1 {"b":"11", "c":"12"}
2 1 {"b":"13", "c":"14"}
3 1 {"b":"15", "c":"16"}
4 1 {"b":"17", "c":"18"}
5 2 {"b":"21", "c":"22"}
6 2 {"b":"23", "c":"24"}
7 2 {"b":"25", "c":"26"}
8 2 {"b":"27", "c":"28"}
9 3 {"b":"31", "c":"32"}
10 3 {"b":"33", "c":"34"}
11 3 {"b":"35", "c":"36"}
12 3 {"b":"37", "c":"38"}
Logical Step 2 – Pivot table so that values of field "a" become name of fields (prefixed with a to make sure we comply with column name convention)
Row a1 a2 a3
1 {"b":"11", "c":"12"} null null
2 {"b":"13", "c":"14"} null null
3 {"b":"15", "c":"16"} null null
4 {"b":"17", "c":"18"} null null
5 null {"b":"21", "c":"22"} null
6 null {"b":"23", "c":"24"} null
7 null {"b":"25", "c":"26"} null
8 null {"b":"27", "c":"28"} null
9 null null {"b":"31", "c":"32"}
10 null null {"b":"33", "c":"34"}
11 null null {"b":"35", "c":"36"}
12 null null {"b":"37", "c":"38"}
Note: size of above data is of same order as size of original table (w/o column a)
It is still bigger than original data because data now is in verbose json format vs native data types + column names.
This can be optimized by eliminating spaces, not needed quotes, normalizing/minimizing original column names to have just one char in name, etc.
I think this difference becomes negligible with N going up! (haven’t had chance to evaluate this though)
Step 3 – Preserve resulted pivot into table TableAllPivot
Implementation Example:
SELECT
IF(a=1, json, NULL) as a1,
IF(a=2, json, NULL) as a2,
IF(a=3, json, NULL) as a3
FROM (
SELECT a, CONCAT("{\"b\":\"",STRING(b), "\","," \"c\":\"", STRING(c), "\"}") AS json
FROM TableAll
)
Cost of Step 3: $5 * TableAllSizeInTB
Based on comments in Step 2 assume: Size(TableAllPivot) = 2 * Size(TableAll)
Step 4 – Produce Shards, by querying only one column per shard
To preserve schema/data-types – respective Shard Tables can be created in advance
Data Extraction :
//For TableA1:
SELECT
JSON_EXTRACT_SCALAR(a1, '$.b') AS b,
JSON_EXTRACT_SCALAR(a1, '$.c') AS c
FROM TableAllPivot
WHERE NOT a1 IS NULL
//For TableA2:
SELECT
JSON_EXTRACT_SCALAR(a2, '$.b') AS b,
JSON_EXTRACT_SCALAR(a2, '$.c') AS c
FROM TableAllPivot
WHERE NOT a2 IS NULL
//For TableA3:
SELECT
JSON_EXTRACT_SCALAR(a3, '$.b') AS b,
JSON_EXTRACT_SCALAR(a3, '$.c') AS c
FROM TableAllPivot
WHERE NOT a3 IS NULL
Cost of Step 4: $5 * TableAllPivot
Total Cost: Step 3 Cost + Step 4 Cost
=
$5 * SizeInTB(TableAll) + $5 * SizeInTB(TableAllPivot)
~ $5 * 3 * SizeInTB(TableAll)
Summary:
Proposed approach fixed price = $5 * 3 * SizeInTB(TableAll)
vs.
Initial linear price = $5 * N * SizeInTB(TableAll)
Please note: 3
in $5 * 3 * SizeInTB(TableAll)
formula is not defined by number of shards in my simplified example, but rather estimated constant that mostly reflects price of transforming data to json. Number of shards doesnt matter here. Same formula will be for 100 shards and for 1K shard and so on. The only limitation in this solution is 10K shards as this is a hard limit for number of columns in one table
C Some helper code and references:
1 Produce Pivoting Query (result is used in step 3 above section)
Can be useful for number of fields in initial table greater than let's say 10-20, when typing query manually is boring, so you can use below script/query
SELECT 'SELECT ' +
GROUP_CONCAT_UNQUOTED(
'IF(a=' + STRING(a) + ', json, NULL) as a' + STRING(a)
)
+ ' FROM (
SELECT a,
CONCAT("{\\\"b\\\":\\\"\",STRING(b),"\\\","," \\\"c\\\":\\\"\", STRING(c),"\\\"}") AS json
FROM TableAll
)'
FROM (
SELECT a FROM TableAll GROUP BY a
)
2 In case if you want to explore and dive more into this option - see also below references to related & potentially useful here code
Pivot Repeated fields in BigQuery
How to scale Pivoting in BigQuery?
How to extract all the keys in a JSON object with BigQuery