Redshift SQL query - optimize

Question

I have a query which is taking more than 15 mins to execute in Redshift. This query is being triggered using AWS Lambda which has a timeout of 15 mins. So, I wanted to check if there is a way to optimize the query to make it give results fast.

Here is my SQL query:

 insert into
  test.qa_locked
select
  '1d8db587-f5ab-41f4-9c2b-c4e21e0c7481',
  'ABC-013505',
  'ABC-013505-2-2020',
  user_id,
  cast(TIMEOFDAY() as timestamp)
from
  (
    select
      user_id
    from
      (
                select
                  contact_id
                from
                  test.qa_locked
          )
        where
          contact_cnt <= 1
      )
  )

Here is the plan:

XN Subquery Scan "*SELECT*" (cost=1000028198481.69..1000028198481.75 rows=1 width=218)
     ->  XN Subquery Scan derived_table1 (cost=1000028198481.69..1000028198481.73 rows=1 width=210)
         ->  XN Window (cost=1000028198481.69..1000028198481.71 rows=1 width=56)
             ->  XN Sort (cost=1000028198481.69..1000028198481.70 rows=1 width=56)
                 ->  XN Network (cost=1645148.05..28198481.68 rows=1 width=56)
                     ->  XN Hash NOT IN Join DS_DIST_OUTER (cost=1645148.05..28198481.68 rows=1 width=56)
                         ->  XN Hash NOT IN Join DS_DIST_INNER (cost=1645147.76..28091814.71 rows=1 width=56)
                             ->  XN Hash NOT IN Join DS_DIST_INNER (cost=1645147.09..7491814.01 rows=1 width=56)
                                 ->  XN Hash NOT IN Join DS_DIST_INNER (cost=1645146.68..6805146.91 rows=1 width=56)
                                     ->  XN Hash NOT IN Join DS_DIST_INNER (cost=1645146.16..6438479.71 rows=1 width=56)
                                         ->  XN Hash NOT IN Join DS_DIST_NONE (cost=1645145.65..6071812.51 rows=1 width=56)
                                             ->  XN Hash NOT IN Join DS_DIST_NONE (cost=1645145.29..6071812.13 rows=1 width=56)
                                                 ->  XN Hash NOT IN Join DS_DIST_BOTH (cost=1645144.96..6071811.77 rows=1 width=56)
                                                     ->  XN Hash NOT IN Join DS_DIST_NONE (cost=1645144.50..5598477.96 rows=1 width=56)
                                                         ->  XN Hash NOT IN Join DS_DIST_BOTH (cost=1645144.47..5598477.91 rows=1 width=84)
                                                             ->  XN Hash NOT IN Join DS_DIST_OUTER (cost=1645142.59..5078476.00 rows=1 width=84)
                                                                 ->  XN Hash NOT IN Join DS_BCAST_INNER (cost=1645142.57..4065142.63 rows=1 width=600)
                                                                     ->  XN Hash Left Join DS_DIST_BOTH (cost=1201145.21..3221145.24 rows=1 width=1116)
                                                                         ->  XN Seq Scan on contacts xa (cost=1201145.21..1201145.21 rows=1 width=640)
                                                                         ->  XN Hash (cost=0.00..0.00 rows=1 width=556)
                                                                             ->  XN Seq Scan on accounts ya (cost=0.00..0.00 rows=1 width=556)
                                                                     ->  XN Hash (cost=443997.35..443997.35 rows=1 width=32)
                                                                         ->  XN Subquery Scan "IN_subquery" (cost=23989.76..443997.35 rows=1 width=32)
                                                                             ->  XN Unique (cost=23989.76..443997.34 rows=1 width=516)
                                                                                 ->  XN Nested Loop DS_BCAST_INNER (cost=23989.76..443997.34 rows=1 width=516)
                                                                                     ->  XN Seq Scan on accounts con (cost=0.00..0.00 rows=1 width=516)
                                                                                     ->  XN Hash NOT IN Join DS_DIST_OUTER (cost=23989.76..83997.32 rows=1 width=26)
                                                                                         ->  XN Seq Scan on campaign_exclusion_list cam (cost=0.00..7.53 rows=1 width=26)
                                                                                         ->  XN Hash (cost=23989.75..23989.75 rows=1 width=32)
                                                                                             ->  XN Subquery Scan "IN_subquery" (cost=0.00..23989.75 rows=1 width=32)
                                                                                                 ->  XN Unique (cost=0.00..23989.74 rows=1 width=18)
                                                                                                     ->  XN Seq Scan on campaign_inclusion_list (cost=0.00..23989.74 rows=1 width=18)
                                                                 ->  XN Hash (cost=0.01..0.01 rows=1 width=516)
                                                                     ->  XN Subquery Scan "IN_subquery" (cost=0.00..0.01 rows=1 width=516)
                                                                         ->  XN Unique (cost=0.00..0.00 rows=1 width=516)
                                                                             ->  XN Seq Scan on contacts (cost=0.00..0.00 rows=1 width=516)
                                                             ->  XN Hash (cost=1.88..1.88 rows=1 width=210)
                                                                 ->  XN Seq Scan on bh_email_open_clicks (cost=0.00..1.88 rows=1 width=210)
                                                         ->  XN Hash (cost=0.01..0.01 rows=1 width=210)
                                                             ->  XN Subquery Scan "IN_subquery" (cost=0.00..0.01 rows=1 width=210)
                                                                 ->  XN Unique (cost=0.00..0.00 rows=1 width=28)
                                                                     ->  XN Seq Scan on contacts (cost=0.00..0.00 rows=1 width=28)
                                                     ->  XN Hash (cost=0.45..0.45 rows=1 width=210)
                                                         ->  XN Seq Scan on bh_leads (cost=0.00..0.45 rows=1 width=210)
                                                 ->  XN Hash (cost=0.32..0.32 rows=1 width=402)
                                                     ->  XN Subquery Scan "IN_subquery" (cost=0.30..0.32 rows=1 width=402)
                                                         ->  XN HashAggregate (cost=0.30..0.31 rows=1 width=402)
                                                             ->  XN Seq Scan on campaign_extraction_history (cost=0.00..0.30 rows=1 width=402)
                                             ->  XN Hash (cost=0.35..0.35 rows=1 width=402)
                                                 ->  XN Subquery Scan "IN_subquery" (cost=0.33..0.35 rows=1 width=402)
                                                     ->  XN HashAggregate (cost=0.33..0.34 rows=1 width=402)
                                                         ->  XN Seq Scan on campaign_extraction_history (cost=0.00..0.33 rows=1 width=402)
                                         ->  XN Hash (cost=0.50..0.50 rows=1 width=210)
                                             ->  XN Seq Scan on bh_leads (cost=0.00..0.50 rows=1 width=210)
                                     ->  XN Hash (cost=0.50..0.50 rows=1 width=210)
                                         ->  XN Seq Scan on bh_leads (cost=0.00..0.50 rows=1 width=210)
                                 ->  XN Hash (cost=0.40..0.40 rows=1 width=402)
                                     ->  XN Seq Scan on campaign_extraction_history (cost=0.00..0.40 rows=1 width=402)
                             ->  XN Hash (cost=0.30..0.30 rows=30 width=402)
                                 ->  XN Seq Scan on ce_locked_records_tb (cost=0.00..0.30 rows=30 width=402)
                         ->  XN Hash (cost=0.27..0.27 rows=1 width=210)
                             ->  XN Subquery Scan "IN_subquery" (cost=0.26..0.27 rows=1 width=210)
                                 ->  XN HashAggregate (cost=0.26..0.26 rows=1 width=210)
                                     ->  XN Seq Scan on bh_leads (cost=0.00..0.25 rows=1 width=210)

Please suggest if there are any ways to optimize this query.

Was this query written by hand, or is it generated by some BI tool? The first thing to notice is that it is _very_ complex, with all those sub-selects. It would be great to get rid of them. The other thing is that it has 26 `not in` operators, which are notoriously bad for efficiency. A `not in` requires the selection of a whole lot of data, then checking that the desired value isn't in each returned row. That makes things very slow in _any_ database. They also seem to be responsible for most of the `cost` calculations. There are 31 sub-queries (`SELECT`) in that query, too. — John Rotenstein, May 09 '20 at 02:03
@JohnRotenstein the query generated using a stored procedure. All the code is written by hand. What would be an alternative way of not in which would be efficient? — dang, May 09 '20 at 04:22
A `NOT IN` can often be replaced by a `LEFT OUTER JOIN`. Then, confirm that a joined field is NULL. There is quite a bit of discussion on the Internet about this, eg: [SQL performance on LEFT OUTER JOIN vs NOT EXISTS](https://stackoverflow.com/a/6778034/174777) and [Consider using NOT EXISTS instead of NOT IN with a subquery - Redgate Software](https://www.red-gate.com/hub/product-learning/sql-prompt/consider-using-not-exists-instead-not-subquery) and [NOT IN vs. NOT EXISTS vs. OUTER APPLY vs. OUTER JOIN](https://sqlperformance.com/2012/12/t-sql-queries/left-anti-semi-join). — John Rotenstein, May 09 '20 at 04:34
@JohnRotenstein are you able to rewrite my query to make it fast? — dang, May 09 '20 at 04:35
Any attempt to rewrite such a complex query would need to be made against the actual data source (ie by YOU!). If the query is going to be run often, then one option would be to create tables with computed results of the `NOT IN` sub-queries, so the complexity can be reduced. These tables could be reloaded, say, on a daily basis. Then, you could JOIN against them. — John Rotenstein, May 09 '20 at 04:55
@JohnRotenstein ok, creating tables for NOT IN is not possible as this query is executed based on the requests coming in. This query is for new jobs which gets created and we have to use the latest data available in the tables. Do you think there can be some quick wins in the query? For ex. Can we change not in to not exist, would it perform better? Or any other small tweaks? — dang, May 09 '20 at 04:58
The EXPLAIN plan shows a `cost` figure. You should concentrate on reducing or removing the high costs. In addition, the high costs might be caused by the `DS_DIST_INNER` and `DS_DIST_BOTH` activities. These can typically be avoided by tables sharing the same `DISTKEY` or by replicating tables on all nodes. See: [Evaluating the query plan - Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/dg/c_data_redistribution.html) — John Rotenstein, May 09 '20 at 11:19

score 4 · Accepted Answer · answered May 11 '20 at 13:31

This feels like a query that has been added to time and time again, with a lot of code repetition and many unnecessary table scans.

Understand that my main experience is with MSSQL and not redshift, but for most the same principles will apply.

 (
              lower(xa.primary_function) in (
                select
                  lower(param_val)
                from
                  ce_campaign_spec_tb
                where
                  job_id = '1d8db587-f5ab-41f4-9c2b-c4e21e0c7481'
                  and param = 'primary_function'
                  and relation_id = 4
              )
              and lower(xa.role) in (
                select
                  lower(param_val)
                from
                  ce_campaign_spec_tb
                where
                  job_id = '1d8db587-f5ab-41f4-9c2b-c4e21e0c7481'
                  and param = 'role'
                  and relation_id = 4
              )
              and lower(xa.title) in (
                select
                  lower(title)
                from
                  contacts con
                  inner join ce_campaign_spec_tb camp on lower(con.title) ilike '%' || trim(
                    both ' '
                    from
                      camp.param_val
                  ) || '%'
                where
                  job_id = '1d8db587-f5ab-41f4-9c2b-c4e21e0c7481'
                  and param = 'title'
                  and relation_id = 4
              )
            )

without knowing what this does, you seem to repeating this block of code for 5 times with the only change being relation_id. You start with id 4, then 2, then 1, then 3 and then 5, but other then the id nothing seems to change. there might be subtle differences, but now you start scanning the tables 5 seperate times instead of a single time with a single predicate. depending on the size of the tables this can be a fair amount of data you are scanning

few rows further:

and xa.contact_id not in (
            select
              contact_id
            from
              bh_leads
            where
              (CURRENT_DATE - creation_date :: date) <= 60
              and UPPER(LOB) = 'ABC'
              and agency_id = '1002'
          )
          and xa.contact_id not in (
            select
              contact_id
            from
              bh_leads
            where
              (CURRENT_DATE - creation_date :: date) <= 60
              and UPPER(LOB) = 'ABC'
              and sponsor_id = '8306'
          )

again 2 table scans for virtually the same data, only difference being on checks the sponsor_id for a value and the other the agency_id. this could have been done in a single statement instead of 2

further down:

and email_id not in (
            select
              distinct email_id
            from
              contacts
            where
              is_email_suppressed = 1
          )

earlier you referenced contact (xa) and put this as a predicate in the where clause:

and xa.is_email_suppressed = 0

without knowing the exact schema of the tables in question i cannot be sure, but they seem to do largely the same thing.

also, from the Redshift documentation here: https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html

it seems you can create temporary tables for the duration of a single session. most of the subqueries could be prepared so you can join on the resultset. if you first prepare, for example, a temporary resultset for the campaign_extraction_history table with valid results you can replace the following predicates with a single left join:

              AND contact_id NOT IN (
            select
              contact_id
            from
              campaign_extraction_history
            where
              sf_oms_campaign_id = 'ABC-013505-2-2020'
              and sf_campaign_id = 'ABC-013505'
              and (CURRENT_DATE - creation_date :: date) < 1
              and channel = 'BOTH'
              and (
                UPPER(STATUS) = 'EXTRACTED'
                OR UPPER(STATUS) = 'LAUNCHED'
                OR UPPER(STATUS) = 'CONFIRMED'
              )
          )
          AND contact_id NOT IN (
            select
              contact_id
            from
              campaign_extraction_history
            where
              creation_date :: date = CURRENT_DATE
              and channel = 'BOTH'
              and (
                UPPER(STATUS) = 'EXTRACTED'
                OR UPPER(STATUS) = 'LAUNCHED'
                OR UPPER(STATUS) = 'CONFIRMED'
              )
            group by
              contact_id
            having
              count(*) > 10
          )
          AND contact_id NOT IN (
            select
              contact_id
            from
              campaign_extraction_history
            where
              sf_campaign_id = 'ABC-013505'
              and channel = 'BOTH'
              and (
                UPPER(STATUS) = 'EXTRACTED'
                OR UPPER(STATUS) = 'LAUNCHED'
                OR UPPER(STATUS) = 'CONFIRMED'
              )
            group by
              contact_id
            having
              count(*) >= 3
          )

there are probebly more places where you can combine query's and get data from tables just a single time. for example, you exclude many email_id values, but on various places in different statements and subqueries. they can most likely be done in a single statement.

maybe the best way to improve performance is to ask yourself what the query is trying to do and exclude, and then just rewrite the entire query. this might be a fair amount of work, but could end up being faster in the long run.

for your first point - I agree that I have repeated the same block 5 times. But because I want to put AND condition for same relation_id. For ex. relation_id = 1 should have the conditions in AND. Then I take the next one separated by OR. How to modify this? — dang, May 11 '20 at 18:38
replace the ' and relation_id = 4' with 'and relation_id between 1 and 5', or substitute 1 and 5 with a min and max value from another query. if the valid ID's arent always a 1 increment you can also use CTE's or temp tables to build up valid ID's (or any other subselection ) for use within the transaction — Repr, May 13 '20 at 10:58
I need to compare values of primary_function, role, function for same relation_id. It cannot be between 1 and 5. — dang, May 13 '20 at 10:59
if you use relation id between 1 and 5 you get 5 results. then you find the single row where all columns match (primary_function, role and function) — Repr, May 14 '20 at 13:03

Redshift SQL query - optimize

1 Answers1