Redshift Sort Keys For Joins

Question

I've read various posts and am still unclear. With a star schema, I would think that if I drive a query off a dimension table, say d_article, I end up with a set of SKs (sk_article) that are used to query/probe the main fact table. So, it makes sense to set sort keys on the fields commonly used in the Where clause on that dim table.

Next...and here's what I can't find an example or answer...should I include sk_article in a sort key in the fact table? More specifically, should I create an interleaved sort key with all the various SKs since we don't always use the same ones to join to the fact table?

I have seen no reference to including sort keys for use in Joins, only.

https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html

Amazon Redshift Foreign Keys - Sort or Interleaved Keys

score 1 · Answer 1 · edited Oct 10 '22 at 18:37

1

Redshift Sort Key

Sort keys are just for sorting purpose, not for joining purpose. There can be multiple columns defined as Sort Keys. Data stored in the table can be sorted using these columns. The query optimizer uses this sort ordered table while determining optimal query plans.

Also, as Tony commented,

Sort Keys are primarily meant to optimize the effectiveness of the Zone Maps (sort of like a BRIN index) and enabling range restricted scans. They aren't all that useful on most dimension tables because dimension tables are typically small. The only time a Sort Key can help with join performance is if you set everything up for a Merge Join - that usually only makes sense for large fact-to-fact table joins. Interleaved Keys are more of a special case sort key and do not help with any joins.

Every type of those keys has specific purpose. This may be good read for you.

For joining, fact and dimension tables, you should be using distribution key.

Redshift Distribution Keys (DIST Keys)

It determine where data is stored in Redshift. Clusters store data fundamentally across the compute nodes. Query performance suffers when a large amount of data is stored on a single node. Here is good read for you.

I hope this answers your question.

A good video session is here, that may be really helpful in understanding SORT VS DIST Key.

edited Oct 10 '22 at 18:37

Faiz

5,331
10
45
57

answered Oct 30 '18 at 09:35

Red Boy

5,429
3
28
41

"Sort keys are just for sorting purpose, not for joining purpose." >>Yes, but we are still probing the fact table for a subset of records. Why wouldn't the SKs retrieved from the dim tables not be used to filter the records in the fact table? I mean, if I took the list of those SKs and put them into Where clause on the fact table without the join, having a sort key on that SK column would speed up retrieval, correct? – Eric Oct 30 '18 at 12:12
To that point yes it will help, because the column is sorted, hence would speed up, but primary purpose is for sorting. While distribution keys really trick for joins, you have to distribute your data properly for dim tables, to minimize the data move from one node to another by Redshift while executing your join queries. – Red Boy Oct 30 '18 at 12:20
Yes, I understand the importance of using distribution keys. But to the point of sort keys, it sounds like putting an interleaved sort key on the SK fields in my fact table should speed things up (in theory)? – Eric Oct 30 '18 at 12:31
2

Sort Keys are primarily meant to optimize the effectiveness of the Zone Maps (sort of like a BRIN index) and enabling range restricted scans. They aren't all that useful on most dimension tables because dimension tables are typically small. The only time a Sort Key can help with join performance is if you set everything up for a Merge Join - that usually only makes sense for large fact-to-fact table joins. Interleaved Keys are more of a special case sort key and do not help with any joins. – Tony Gibbs Oct 31 '18 at 19:19
@TonyGibbs Thank you for point. Yes, I agreed with you point, editing my answer and adding this additional information. – Red Boy Nov 01 '18 at 07:26
OK. So, no point putting the SKs in the fact article table into an interleaved sort key since it won't be used on joins. So, the strategy I'll use it to put interleaved sort keys on the dimension tables after analyzing which fields are filtered most often (via Explain plans). Filter criteria seems to vary, otherwise I'd just use a single/compound sort key. – Eric Nov 01 '18 at 08:35
I found that joins to small table (~10 rows) on the sortkey have a much lower blocks_read in stl_query_metrics (or similar metrics in stl_scan). It's apparently using the zone map on the sortkey column even though the plan shows a hash join only (no filtering on the join column in the scan). – KWillets Mar 31 '20 at 18:14

Redshift Sort Keys For Joins

1 Answers1