BigQuery table design for immutable data

Question

Background

We're probably going to use BigQuery to store our immutable business events so that we can replay them later to other services. I'm thinking that one approach would be to essentially just store each event as a blob (with some metadata). In order to replay them easily it would of course be nice to maintain a global order of our events and just persist each event to the same table in BigQuery. We probably have something like 10 events per second (which is nowhere near the limit of 100000 messages per second).

Question

Would it be ok to simply persist all events in the same table?
Would it perhaps be better to shard messages in different tables (perhaps based on event type, topic or date)?
If (2), is it possible to join/scan through multiple tables sorted by time so that it's possible to replay events in the same order?

score 1 · Answer 1 · answered Jan 03 '16 at 14:47

1

i prefer create table based on event type and store the time in event table,you can join tables using relationship(use primary,foreign key).Since its storedon time basis you can replay as well.

Points you must remember:

Immutable business events will give you concurrency,Once an event has been accepted and committed, it becomes an unalterable,it can be copied everywhere.
The only way to “undo” an event is to add a compensating event on top like a negative transaction in accounting.

Hope its useful to you.

answered Jan 03 '16 at 14:47

Anburaj_N

109
1
1
10

Could you elaborate a bit on what you mean by "relationship" in this context? I don't see why I would need a relationship between my events? – Johan Jan 03 '16 at 14:53
you can use without relationship johan,but if u use you will have less duplicate data that will give you more performance. – Anburaj_N Jan 03 '16 at 15:19
Do I understand you correctly that you mean that I should store the event time (as id?) in the "event table" but the actual event in another table based on its type (i.e. "sharded" by event type)? And when I want to replay the events I start from a date/timestamp in the "event table" and join it with the actual event from the corresponding "event type" table? – Johan Jan 03 '16 at 16:22

score 1 · Accepted Answer · answered Jan 03 '16 at 17:54

1

If you primary usage scenario to store events and then reply them - there is no reason to split different event types into different tables. Especially since each event is an opaque blob. Keeping them all in the same table will have small benefit of you being able to do analysis by types of events and other metadata. Sharding by days makes sense, especially if you will be looking at the most recent data - this will help you to keep the BigQuery query costs down.

But I was worried about your requirement of replying events in order. There is no clustered index in BigQuery, so every time you will need to reply your events, you will have to use "ORDER BY timestamp" in your query, and it can scale only to relatively small amount of data (tens of megabytes). So you will want to replay a lot of events - this design won't work for you.

answered Jan 03 '16 at 17:54

Mosha Pasumansky

13,206
5
32
55

Is it possible to use some sort of "table scanning" and avoid "order by timestamp"? I.e. stream all events in the table by insertion order? My hope is that this would work If I insert them in the same order that they are received. But that implies (I assume) that I need to keep all events in the same table? – Johan Jan 03 '16 at 18:00
1

No, it is not possible to do a table scan using insertion time as ordering. You will really have to have timestamp field if you need ordering. – Mosha Pasumansky Jan 03 '16 at 18:02
Thanks for the clarification. Our primary use case would be to store the events for the sake of replay. And for this use case ordering (by insertion or timestamp) is important. Would you say that BigQuery is perhaps not a good fit for this? – Johan Jan 03 '16 at 20:51
1

Yes, based on your requirements - I would say BigQuery is not a good fit. – Mosha Pasumansky Jan 04 '16 at 03:53
I know this is off topic but is there another managed datastore (preferably on the google cloud) that would be better suited for this? – Johan Jan 04 '16 at 06:59
1

Here's an idea: you could use BigQuery to collect one day's worth of data (with timestamps), then sort and export the data to GCS each day. Replay could be done from the data in Google Cloud Storage. BigQuery solves the fan-in-and-append problem for you, and GCS allows streaming the blob reads so you get replay. You can tune the "1 day's worth of data" as small as needed to allow the BigQuery sort operation to work. – Michael Sheldon Jan 04 '16 at 18:27
@MichaelSheldon Good idea, I was actually thinking of something similar as an alternative. Another option I'm thinking of is to pipe our events to dataflow and from there aggregate them per daily basis (if possible) and then store them on GCS. – Johan Jan 05 '16 at 05:30

BigQuery table design for immutable data

2 Answers2

i prefer create table based on event type and store the time in event table,you can join tables using relationship(use primary,foreign key).Since its storedon time basis you can replay as well.