1

We have this query in production which runs daily It does a lot of joins and also uses window function in hive

We tried to add few set options but that did not help much

Structure is something like this -

SELECT
        C.f1, C.f2, A.f2 ...
FROM (
    SELECT * FROM (
        SELECT T1.*, B.atid, B.a_id,
        ROW_NUMBER() OVER (PARTITION BY T1.wtid, B.atid ORDER BY T1.b_ts DESC) AS RANK_
        FROM T1 AS T1
        JOIN T5 ON T1.t_dt = T5.t_dt
        JOIN T2 B ON T1.wtid = B.wtid and T1.b_ts = B.b_ts
        LEFT OUTER JOIN (SELECT p_cd FROM T3 WHERE PV_TY_CD = 'ORIG_CD') PV
        ON T1.TYP = PV.p_cd
        WHERE T1.state not in ("INVALID")
        AND T1.evt_name NOT IN ('INACTIVE','DORMANT')
        AND ISNULL(PV.p_cd)
    ) T
    WHERE T.rank_ = 1
) A

JOIN (SELECT *, row_number() over (partition by ac_id order by b_ts desc) rank_  
      FROM T4
      WHERE event not in ('CT','UPD')
     ) AS C
  ON A.a_id = C.a_id
AND A.atid = C.ac_id
AND C.rank_ = 1
JOIN T6 ON C.t_dt = T6.t_dt
  • As i cannot ignore any tables ( and joins ), My approach was to substitute the window function with another join using aggregate function max but i was not able to rewrite it.
  • Also i am not sure if that will surely help to improve performance so any guidance will help us.
tk421
  • 5,775
  • 6
  • 23
  • 34
Vidya K
  • 103
  • 8
  • If you are not running it on Tez, try it. The more complex query is the more improvement from Tez. Tune parallelism if not possible to improve query: https://stackoverflow.com/a/48296562/2700344 – leftjoin Apr 26 '19 at 14:35

1 Answers1

2

Analytic functions usually perform better than joins with select max, because you are reading the same table only once in case of analytic function and row_number calculation is parallelized by partition by.

Try to regroup joins and filtering.

Join

LEFT OUTER JOIN (SELECT p_cd FROM T3 WHERE PV_TY_CD = 'ORIG_CD') PV
        ON T1.TYP = PV.p_cd

with where condition ISNULL(PV.p_cd) is reducing some rows in T1. The same do these conditions:

WHERE T1.state not in ("INVALID")
        AND T1.evt_name NOT IN ('INACTIVE','DORMANT')

Move this join into the subquery, if it filters a lo, this may help to reduce the dataset in T1 before all other joins and row_number():

(select T1.* from T1 
             left join (SELECT p_cd FROM T3 WHERE PV_TY_CD = 'ORIG_CD') PV 
                       ON T1.TYP = PV.p_cd 
 where T1.state not in ("INVALID")
        AND T1.evt_name NOT IN ('INACTIVE','DORMANT')
        AND ISNULL(PV.p_cd)
) as T1 

Also first row_number is calculated only on T1 and B tables:

PARTITION BY T1.wtid, B.atid ORDER BY T1.b_ts DESC

Consider joining T5 table after row_number filter, if this join is heavy, and row_number filter is reducing the dataset, then wrap row_number with filter in the subquery again and join subquery filtered with T5.

(--filtered by row_number
select * from
(
 SELECT T1.*, B.atid, B.a_id,
        ROW_NUMBER() OVER (PARTITION BY T1.wtid, B.atid ORDER BY T1.b_ts DESC) AS RANK_
  from
    (select T1.* from T1 
                 left join (SELECT p_cd FROM T3 WHERE PV_TY_CD = 'ORIG_CD') PV 
                           ON T1.TYP = PV.p_cd 
     where T1.state not in ("INVALID")
            AND T1.evt_name NOT IN ('INACTIVE','DORMANT')
            AND ISNULL(PV.p_cd)
    ) as T1 JOIN T2 B ON T1.wtid = B.wtid and T1.b_ts = B.b_ts
) T WHERE T.rank_ = 1
) T --filtered
JOIN T5 ON T1.t_dt = T5.t_d  

This may help depending on your data.

Read also: https://stackoverflow.com/a/51061613/2700344 and this: https://stackoverflow.com/a/51061613/2700344

leftjoin
  • 36,950
  • 8
  • 57
  • 116