1

I'm working on building a redshift database by listening to events from from different sources and pump that data into a redshift cluster.

The idea is to use Kinesis firehose to pump data to redshift using COPY command. But I have a dilemma here: I wish to first query some information from redshift using a select query such as the one below:

select A, B, C from redshift__table where D='x' and E = 'y';

After getting the required information from redshift, I will combine that information with my event notification data and issue a request to kinesis. Kinesis will then do its job and issue the required COPY command.

Now my question is that is it a good idea to repeatedly query redshift like say every second since that is the expected time after which I will get event notifications?

Now let me describe an alternate scenario:

If I normalize my table and separate out some fields into a separate table then, I will have to perform fewer redshift queries with the normalized design (may be once every 30 seconds)

But the downside of this approach is that once I have the data into redshift, I will have to carry out table joins while performing real time analytics on my redshift data.

So I wish to know on a high level which approach would be better:

  1. Have a single flat table but query it before issuing a request to kinesis on an event notification. There wont be any table joins while performing analytics.

  2. Have 2 tables and query redshift less often. But perform a table join while displaying results using BI/analytical tools.

Which of these 2 do you think is a better option? Let us assume that I will use appropriate sort keys/distribution keys in either cases.

paratrooper
  • 434
  • 1
  • 10
  • 18
  • What *exactly* do you mean by ["normalization"](https://stackoverflow.com/a/40640962/3404097), both in general & here? Likely "to 1NF" eliminating "non-atomic" values? Alas: "best" (1) means nothing until the asker defines it and (2) depends on so many high-level through low-level details (usage, implementation, resources, cost-benefit) it must be tested empirically. "Appropriate key distribution" is the right direction. Seek to pin down "best" & 2 designs to guestimate & test on your worflow & setup. – philipxy Jun 18 '17 at 08:57
  • "is it a good idea to repeatedly query redshift like say every second" - Redshift is not an OLTP style database so is optimised for fewer very large queries, not many very small queries. You might find that the response time in this scenario is not fast enough for one query per second. You would also need to consider *concurrency* (number of concurrent queries permitted) - there is a maximum concurrency of 50 across the cluster so your process would be using up one of the available slots almost permanently. http://docs.aws.amazon.com/redshift/latest/dg/cm-c-defining-query-queues.html – Nathan Griffiths Jun 19 '17 at 01:58

1 Answers1

3

I'd definitely go with your second option, which involves JOINing with queries. That's what Amazon Redshift is good at doing (especially if you have your SORTKEY and DISTKEY set correctly).

Let the streaming data come into Redshift in the most efficient manner possible, then join when doing queries. You'll have a lot less queries that way.

Alternatively, you could run a regular job (eg hourly) to batch process the data into a wide table. It depends how quickly you'll need to query the data after loading.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • I'm also inclined to the second option. I guess that would be the way to go. So I'm accepting your answer. I just wanted a headstart. Probably I will have to test that empirically too. Thanks! – paratrooper Jun 19 '17 at 05:20