How do I shard a BigQuery table?

Question

Apologies if this has been answered elsewhere, I couldn't find anything similar.

Are there any ways to split a table into multiple shards without having to use multiple queries? Here are two examples:

1) Loading a table with timestamped data (unix timestamp). I want to save the data into one table by day. The naive approach is: a) load the data; b) run a query to get all the data for each day and append it into the appropriate shard. This approach will result in queries that touch N x [size of the whole table] data, where N is the number of days in the table. Plus one more query to find the min and max timestamp so I can establish the range of shards I need to create.

2) Splitting a table to shards using the data in a field. For example, a table of 1 billion rows, containing a field X with 1,000 different values. If I want to split the table to 1000 different tables, one table for each of the values of X, then the naive approach would be to run a SELECT * FROM table WHERE X=[value], and insert the results into shard table_value. However, that would result in 1000 queries, each touching the data of the whole table!

Surely I am missing something and there must be more efficient ways to do the above.

please clarify, why you don't load or stream your data to dily tables to start with , so no needs in later splition? or this is for historical data to be splitted? — Mikhail Berlyant, Jan 22 '16 at 20:18
In question (1) above, the data is in a csv table. Is there an option in the "load" function to load it in separate tables, by day? If so, I'm happy to use it. Otherwise, you are implicitly assuming that the data is imported on a per-day basis, which is not the case. A given csv file may contain data from multiple days. I would have to split the csv file in multiple files (one per day) and then load it. I.e., I would have to use a different database before using BQ. — user3688176, Jan 22 '16 at 21:20
In case (2), the data is generated from a query to begin with. So we have table A and table B and generate table C as: SELECT col1, col2, col3 FROM A JOIN B ON xyz_condition. The data need to be sharded based on the value of col2. Can I shard the data using the join query? Then happy to do it this way — user3688176, Jan 22 '16 at 21:22
I found a way to make the above less onerous, but still not ideal: (a) run a query that orders the data of a table by the field to be sharded (timestamp in case 1 or col2 in case 2); (b) run another query that finds the min and max id (I might have to add an "id" of sorts here, although that's implicit in timestamps) for each shard (ie, the min timestamp if we are sharding by day; or the location of the min (col2) location); (c) list the results of the table created by query (a) in intervals defined by the min/max of query (b). IF this works it will need 3 scans of a table for the sharding. — user3688176, Jan 22 '16 at 22:15
2018: I updated my answer below to show how to partition a table for free (or with only one scan). https://stackoverflow.com/a/34959361/132438 — Felipe Hoffa, Nov 13 '18 at 01:58

score 8 · Accepted Answer · edited May 23 '17 at 12:30

A Requirements:

Let’s assume below simplified case/scenario

1 We have one "big" table:
TableAll

Row a   b   c
1   1   11  12
2   1   13  14
3   1   15  16
4   1   17  18
5   2   21  22
6   2   23  24
7   2   25  26
8   2   27  28
9   3   31  32
10  3   33  34
11  3   35  36
12  3   37  38

2 We need to split data to separate "smaller" tables partitioned by filed "a"
TableA1

TableA2

TableA3

3 Problem to address
Most straightforward way is to issue three separate statements with writing output to respectively TableA1, TableA2, TableA3

SELECT b, c FROM TableAll WHERE a = 1;
SELECT b, c FROM TableAll WHERE a = 2;
SELECT b, c FROM TableAll WHERE a = 3;

Pros: Fast and Furious!
Cons: We need as many table scans of whole table (full cost) as many distinct value of "a" we have (in this particular case just three, but in real life it can be let’s say up to N=1K distinct values).

So final Cost is $5 * N * SizeInTB(TableAll)

Our Target Goal

We want to minimize cost as much as possible 
ideally down to fixed price of $5 * SizeInTB(TableAll)

B Possible Solution (Idea and simple implementation):

Logical Step 1 – transform data to be presented as below (transform columns into JSON)

Row a   json
1   1   {"b":"11", "c":"12"}
2   1   {"b":"13", "c":"14"}
3   1   {"b":"15", "c":"16"}
4   1   {"b":"17", "c":"18"}
5   2   {"b":"21", "c":"22"}
6   2   {"b":"23", "c":"24"}
7   2   {"b":"25", "c":"26"}
8   2   {"b":"27", "c":"28"}
9   3   {"b":"31", "c":"32"}
10  3   {"b":"33", "c":"34"}
11  3   {"b":"35", "c":"36"}
12  3   {"b":"37", "c":"38"}

Logical Step 2 – Pivot table so that values of field "a" become name of fields (prefixed with a to make sure we comply with column name convention)

Row a1                    a2                    a3
1   {"b":"11", "c":"12"}  null                  null
2   {"b":"13", "c":"14"}  null                  null
3   {"b":"15", "c":"16"}  null                  null
4   {"b":"17", "c":"18"}  null                  null
5   null                  {"b":"21", "c":"22"}  null
6   null                  {"b":"23", "c":"24"}  null
7   null                  {"b":"25", "c":"26"}  null
8   null                  {"b":"27", "c":"28"}  null
9   null                  null                  {"b":"31", "c":"32"}
10  null                  null                  {"b":"33", "c":"34"}
11  null                  null                  {"b":"35", "c":"36"}
12  null                  null                  {"b":"37", "c":"38"}

Note: size of above data is of same order as size of original table (w/o column a)
It is still bigger than original data because data now is in verbose json format vs native data types + column names.
This can be optimized by eliminating spaces, not needed quotes, normalizing/minimizing original column names to have just one char in name, etc.
I think this difference becomes negligible with N going up! (haven’t had chance to evaluate this though)

Step 3 – Preserve resulted pivot into table TableAllPivot Implementation Example:

SELECT 
  IF(a=1, json, NULL) as a1,
  IF(a=2, json, NULL) as a2,
  IF(a=3, json, NULL) as a3 
FROM (
  SELECT a, CONCAT("{\"b\":\"",STRING(b), "\","," \"c\":\"", STRING(c), "\"}") AS json 
  FROM TableAll
)

Cost of Step 3: $5 * TableAllSizeInTB
Based on comments in Step 2 assume: Size(TableAllPivot) = 2 * Size(TableAll)

Step 4 – Produce Shards, by querying only one column per shard
To preserve schema/data-types – respective Shard Tables can be created in advance

Data Extraction :
//For TableA1:

SELECT 
  JSON_EXTRACT_SCALAR(a1, '$.b') AS b, 
  JSON_EXTRACT_SCALAR(a1, '$.c') AS c 
FROM TableAllPivot
WHERE NOT a1 IS NULL

//For TableA2:

SELECT 
  JSON_EXTRACT_SCALAR(a2, '$.b') AS b, 
  JSON_EXTRACT_SCALAR(a2, '$.c') AS c 
FROM TableAllPivot
WHERE NOT a2 IS NULL

//For TableA3:

SELECT 
  JSON_EXTRACT_SCALAR(a3, '$.b') AS b, 
  JSON_EXTRACT_SCALAR(a3, '$.c') AS c 
FROM TableAllPivot
WHERE NOT a3 IS NULL

Cost of Step 4: $5 * TableAllPivot

Total Cost: Step 3 Cost + Step 4 Cost =
$5 * SizeInTB(TableAll) + $5 * SizeInTB(TableAllPivot) ~ $5 * 3 * SizeInTB(TableAll)

Summary:
Proposed approach fixed price = $5 * 3 * SizeInTB(TableAll)
vs.
Initial linear price = $5 * N * SizeInTB(TableAll)

Please note: 3 in $5 * 3 * SizeInTB(TableAll) formula is not defined by number of shards in my simplified example, but rather estimated constant that mostly reflects price of transforming data to json. Number of shards doesnt matter here. Same formula will be for 100 shards and for 1K shard and so on. The only limitation in this solution is 10K shards as this is a hard limit for number of columns in one table

C Some helper code and references:

1 Produce Pivoting Query (result is used in step 3 above section)
Can be useful for number of fields in initial table greater than let's say 10-20, when typing query manually is boring, so you can use below script/query

SELECT 'SELECT ' + 
   GROUP_CONCAT_UNQUOTED(
      'IF(a=' + STRING(a) + ', json, NULL) as a' + STRING(a) 
   ) 
   + ' FROM (
 SELECT a, 
 CONCAT("{\\\"b\\\":\\\"\",STRING(b),"\\\","," \\\"c\\\":\\\"\", STRING(c),"\\\"}") AS json
 FROM TableAll
       )'
FROM (
  SELECT a FROM TableAll GROUP BY a 
)

2 In case if you want to explore and dive more into this option - see also below references to related & potentially useful here code

Pivot Repeated fields in BigQuery
How to scale Pivoting in BigQuery?
How to extract all the keys in a JSON object with BigQuery

Hi Mikhail, thanks for the detailed answer, but it generates some questions. First, I think the "3" in your cost estimate is not a fixed number, but the number of the shards needed. So the cost is O(N * k) where N is the table size and k are the shards. The whole goal is to make this O(N). Second: the solution I outlined above (essentially order the table by col a; add a "row number" column; and then simply list the table from min to max row per column value) *is* O(N), provided that I can export the values of a "list" command into another table. Can I? list [startIndex,maxResults]) -->Shard — user3688176, Jan 23 '16 at 11:48
on #1 - i think it is close to 3*O(n), i tried to provide brief details on this in my answer; on #2 - potential issue here is in question: is physical order guaranteed when you save "sorted" result. If it is - should work. in any case - using tabledata.list API looks creative option here (and it is free); try and let us know. me personally - i dont see yet how step (c) in your plan will be implemented - i might try later. Meantime - I think you should try implementing your option - and if issue - open new question so someone will help — Mikhail Berlyant, Jan 23 '16 at 16:22
Here is step (c). It creates an inner table whose row_num is effectively ordered by timestamp; the outer loop (min/max selection) finds essentially the startIndex of each shard. However -- I still cannot find a way to output the results of a tabledata:list into a destinationTable: SELECT day, min(row_num), max(row_num) FROM ( -- save the below as intermediary table; this is the table we list SELECT col1, col2, timestamp DATE(SEC_TO_TIMESTAMP(timestamp)) as day, row_number() over ( ) as row_num FROM (SELECT col1, col2, timestamp, FROM table ORDER BY timestamp) ) GROUP BY day — user3688176, Jan 23 '16 at 16:44
my recommendation is to open new question with particular details like in your comments. And one more time about Cost - i think that you might thought that 3 shards in my quick example define number 3 in cost estimation. NO - they have nothing to do with each other. Same 3*O(n) will be a cost for 100 shards and for 1K shard and so on. The only limitation in this solution is 10K shards as this is a hard limit for number of columns in one table. I hope this helps. And I am looking forward for new question with details in as above comment :o) — Mikhail Berlyant, Jan 23 '16 at 17:00
2018: I updated my answer below to show how to partition a table for free (or with only one scan). https://stackoverflow.com/a/34959361/132438 — Felipe Hoffa, Nov 13 '18 at 01:57

Felipe Hoffa · Answer 2 · 2018-11-13T01:57:14.663

2018 update

Instead of creating multiple tables, create a partitioned one.
Partition for free: Create partitioned table (by date), import into it.
Partition with one query (one scan): CREATE TABLE ... AS SELECT * FROM old-table

See the following post to also benefit from clustering:

https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b

I really like Mikhail's answer, but let me give you a different one: Divide to conquer:

Let's say your table has 8 numbers (think of each number as a partition): 12345678. To shard this into 8 tables, you are looking into running 8 times a query over a table size 8 (cost: 8*8=64).

What if you first divide this table into 2: 1234, 5678. The cost was 8*2 (2 full scans), but we now have 2 tables. If we want to partition these half tables, now we only need to scan half 2 times (2*4*2). Then we are left with 4 tables: 12,34,56,78. The cost of partitioning them would be 4*2*2... so the total cost would be 8*2+2*4*2+4*2*2=48. By doing halves we took the cost of partitioning a table in 8 from 64 to 48.

Mathematically speaking, we are going from O(n**2) to O(n(log n)) - and that's always a good thing.

Cost-wise Mikhail's answer is better, as it goes from O(n**2) to O(n), but writing the intermediate helper functions will introduce additional complexity to the task.

How do I shard a BigQuery table?

2 Answers2

2018 update