3

Im am building a Spark job to migrate 6 years of ecommerce orders / events into our new BigQuery warehouse to complement the new streaming capability. The migrated data is in the same model as the new data and views will expose parts as needed.

We have 4 datasets:

data (raw data - hierarchical JSON) views (like DIMS over the raw data - generally flat but not always) pre_aggs (e.g currency covverting raw totals with some key rollup dims included - flat) reports (as you would think)

Unlike some of the new live data which is streamed in and unbounded, the migration of this historical data is batch and bounded (I dont have to concern myself with late arriving events / watermarking and duplicates for example). I can also partition the data manually via event time (orderPlacedTimestamp) and persist in the correct date partitioned table (suffix?). The full data is about 2 GBs and 2million rows compressed in BQ, so not massive but fairly complex structures with unnesting needed in the view layer. I have the options to write the raw and aggregates as materialsied tables from Spark, so really wanting to do this in-line with best practice and optimise for performance (query speed is more important and worth paying a bit extra for)

I found this really good blog on the SQL/Query lambda architecture which is some inspiration and will attempt to do similarly.

https://www.linkedin.com/pulse/building-robust-real-time-etl-google-bigquery-lambda-novozhilov

Im still wondering though how best to store / partition this data, and then construct time based queries to match. Week on week, Month on month reports are likely to be the most common.

My options seem:

  1. Everything in one table - seems nice and simple no ongoing table management over time, but means a full scan for every query when often I just want to just go back a year or 2 at most from a point in time?

  2. One table per time segment e.g. yearly, monthly

order_history_2017 or order_history_201701

We have a calendar lookup dim, which each row will have keys which the suffix above could be used — e.g 201701 for Jan 2017.

Monthly would mean 72 tables which seems quite a bit, maybe yearly is better?

For argument sake, say its monthly tables, what is the best way in BQ (standard SQL) to then query the right prefix of tables containing a contiguous timeline, constructing the table name on the fly with the right suffix (dynamically perhaps?

e.g say I want to query all orders (order has an orderPlacedTimestamp) between 2017–01-10 and 2017-02-10 - this would mean scanning (and union?) only the order_history_2017–01-10 and order_history_2017-02-10 tables in this case, the doing a BETWEEN like below:

SELECT *
FROM order_history_201701 UNION ALL
SELECT *
FROM order_history_201702
WHERE order.orderPlacedTimestamp BETWEEN DATE(“2017–01-10”) and DATE(“2017-02-10”)

I might then get scenarios where this historic data needs unioning with the ‘live’ (streaming) data too - wrapped up in a view like the article on the lambda design.

  1. An option I haven’t thought of.

So many options with BQ! :)

Anyway, that is my current thinking, any words of wisdom on this topic would be hugely appreciated in relation to table design and optimised query construction.

Thanks heaps all!

Kurt Maile
  • 1,171
  • 3
  • 13
  • 29
  • 1
    Could you not simply use BigQuery's built in partitioning feature (you see one table, but don't pay for full table scans) - https://cloud.google.com/bigquery/docs/partitioned-tables – Graham Polley Jul 16 '17 at 03:14
  • Hi Graham, thanks for the comment. Unless Ive missed some key features (Ill go back to the docos to confirm now, maybe I have!), partitoning is based on INGEST/LOAD time, not suitable for historic 'real date' partitioning I need. Unless you can set a custom field in your data to be used as the partition time (e.g orderPlacedTime - true event time) I cant see the value in this approach for this use case? But as mentioned, I might have this wrong and will go back and confirm or not my understanding. Once again thanks for replying! :) – Kurt Maile Jul 16 '17 at 06:27
  • You can specify the partition yourself using `$YYYYMMDD` notation (_partition decorator_), therefore putting the data into the desired partition. We use this technique for streaming real-time and loading historical data into the correct partition based on the data's own time field(s). https://cloud.google.com/bigquery/docs/creating-partitioned-tables#restating_data_in_a_partition – Graham Polley Jul 16 '17 at 06:31
  • as far I know you can backdate the partition decorator to max 1-2 years, but not 6 years. This is something I read in one of the previous SO posts. So it might be unreliable info. – Pentium10 Jul 16 '17 at 07:40
  • That's true. But are there a few techniques for working around this limitation (my preference is to use Dataflow). https://stackoverflow.com/questions/38993877/migrating-from-non-partitioned-to-partitioned-tables – Graham Polley Jul 16 '17 at 08:41
  • Hey thanks Graham - your comment on decorators is super helpful, thought this was only for legacy which was the case when I started looking at this....thanks! Ill have to look at how to deal with the not being able to go back 6 years though - Pentium thanks for the link. Dataflow looks interesting for sure but for now we will continue to use spark. – Kurt Maile Jul 16 '17 at 14:00
  • @felipe-hoffa thoughts on the best way to go back more then 1-2 years (not using DataFlow though :) ? – Kurt Maile Jul 16 '17 at 14:01
  • @mikhail-berlyant thoughts? that stackover link is a little hard to follow and a little old, some great chat in it, has there been an update on this at all where those limitations are no more and going back multiple years is possible with cloud dataflow? – Kurt Maile Jul 16 '17 at 14:10

1 Answers1

1

My recommendation is to seriously consider native BigQuery functionality of Partitioned Tables

While streaming to partitioned tables have some limitations (you can stream to partitions within the last 30 days in the past and 5 days in the future relative to the current date) there is no such limitations for load or query jobs

Relatively long ago I have considered using this option for workaround of partitioning by column rather than date by mapping some attributed to date in between '0001-01-01' and '9999-12-31' (3,652,058 days - thus different attribute values to be partitioned by). The POC was successful conceptually but I still didn't like it especially having strong promise (at least that's how I felt that time) from Google Team to introduce partitioning by date or int column of the table. I decided to wait this.

Meantime, to refresh and double check that load or query into partitioned table still has no partition specific limitations (as it somehow sounded in one of SO Post) - I did quick test and you can see result below

Step 1 - Create partitioned Table - project.dataset.partitioned_table I just used UI for this.

Step 2 - Insert query result into table's different partitions

#standardSQL
SELECT 13 AS a

with project.dataset.partitioned_table$YYYYMMDD as destination (you can use DML's INSERT for this)

I run this few times for range of dates between AC (0001-01-01) and end of the ? (9999-21-31)

Step 3 - Check result

#standardSQL
SELECT DATE(_partitiontime) AS partition_, a
FROM `project.dataset.partitioned_table`
ORDER BY a

The result was (have in mind - format of partition_ here is YYYYMMDD)

partition_      a
----------      --
2017-07-16      1
2017-07-16      2
2017-07-16      3
2017-07-16      4
2017-07-15      5
2017-07-14      6
2010-01-01      7
2001-01-01      8
1001-01-01      9
0001-01-01      10
4001-01-01      11
7001-01-01      12
9999-12-31      13
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • What technique/pattern do you use for getting around the max 2500 partitions limit Mikhail (i.e. ~6 years)? - https://cloud.google.com/bigquery/docs/partitioned-tables#partitioned_table_limits – Graham Polley Jul 16 '17 at 23:33
  • ... and not only `2500 max partitions` limit, but also **Daily limit:** `2,000 partition updates per table, per day.` and **Rate limit:** `50 partition updates every 10 seconds.`! It depends, but I think in most practical cases like in this post - there should be one day partitioned table per year (assuming the raw data). – Mikhail Berlyant Jul 17 '17 at 02:29
  • we should strongly consider, to ask this limit to be increased – Pentium10 Jul 17 '17 at 06:56
  • Great discussion guys - so just to clarify then Mikhail, are you saying to get around the limit to partitions that you would have 1 table per year (e.g customers_2017, customers_2016 etc) for which each is then itself a date (day) partitioned table, thus ensuring limits not breached going back in time quite a way if you only had 1 table e.g customers? – Kurt Maile Jul 17 '17 at 08:56
  • yes. one table per yer is an option for now to keep number of partitions within the limits – Mikhail Berlyant Jul 17 '17 at 12:07
  • sure. if not yet - please consider voting up :o) – Mikhail Berlyant Jul 19 '17 at 18:55