0

I have a very big query with quite a few case statements. This is bring the cluster performance down with 100% disc consumption. Processing data is ~16Tb and it take ~24hrs to compute often failing due to no disk space. Creating external tables reducing dependencies on other table doesn't seem to be helping much.

CREATE TEMP TABLE Merge_result 
AS
(SELECT DISTINCT id,
       id2,
       variant,
       department_name,
       pptt,
       aa1,
       CASE
         WHEN desc = 'lang_1' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_2' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_3' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_4' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_5' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_6' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_7' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_8' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_9' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_10' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_11' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_12' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_13' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_14' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_15' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_16' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_17' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_18' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_19' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_20' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_21' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_22' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_23' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_24' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_25' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_26' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_27' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_28' AND aa1 >= 0.76 THEN 0
         WHEN desc = 'lang_29' AND aa1 >= 0.76 THEN 0
         ELSE 1
       END AS aa1_score,
       bb1,
       CASE
         WHEN desc = 'lang_1' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_2' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_3' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_4' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_5' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_6' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_7' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_8' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_9' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_10' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_11' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_12' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_13' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_14' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_15' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_16' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_17' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_18' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_19' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_20' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_21' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_22' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_23' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_24' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_25' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_26' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_27' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_28' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_29' AND bb1 >= 0.76 THEN 0
         ELSE 1
       END AS bb1_score,
       cc1,
       CASE
         WHEN desc = 'lang_1' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_2' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_3' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_4' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_5' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_6' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_7' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_8' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_9' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_10' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_11' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_12' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_13' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_14' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_15' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_16' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_17' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_18' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_19' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_20' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_21' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_22' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_23' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_24' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_25' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_26' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_27' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_28' AND cc1 >= 0.76 THEN 0
         WHEN desc = 'lang_29' AND cc1 >= 0.76 THEN 0
         ELSE 1
       END AS cc1_score,
       dd1,
       CASE
         WHEN desc = 'lang_1' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_2' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_3' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_4' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_5' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_6' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_7' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_8' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_9' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_10' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_11' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_12' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_13' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_14' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_15' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_16' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_17' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_18' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_19' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_20' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_21' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_22' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_23' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_24' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_25' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_26' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_27' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_28' AND dd1 >= 0.76 THEN 0
         WHEN desc = 'lang_29' AND dd1 >= 0.76 THEN 0
        ELSE 1
       END AS dd_score,
       ee1,
       CASE
         WHEN desc = 'lang_1' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_2' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_3' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_4' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_5' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_6' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_7' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_8' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_9' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_10' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_11' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_12' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_13' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_14' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_15' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_16' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_17' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_18' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_19' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_20' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_21' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_22' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_23' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_24' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_25' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_26' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_27' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_28' AND ee1 >= 0.76 THEN 0
         WHEN desc = 'lang_29' AND ee1 >= 0.76 THEN 0
         ELSE 1
       END AS ee1_score,
       ff1,
       CASE
         WHEN desc = 'lang_1' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_2' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_3' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_4' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_5' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_6' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_7' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_8' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_9' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_10' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_11' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_12' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_13' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_14' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_15' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_16' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_17' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_18' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_19' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_20' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_21' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_22' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_23' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_24' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_25' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_26' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_27' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_28' AND ff1 >= 0.76 THEN 0
         WHEN desc = 'lang_29' AND ff1 >= 0.76 THEN 0
         ELSE 1
       END AS ff1_score,
       gg1,
       CASE
         WHEN desc = 'lang_1' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_2' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_3' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_4' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_5' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_6' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_7' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_8' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_9' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_10' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_11' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_12' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_13' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_14' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_15' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_16' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_17' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_18' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_19' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_20' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_21' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_22' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_23' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_24' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_25' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_26' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_27' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_28' AND gg1 >= 0.76 THEN 0
         WHEN desc = 'lang_29' AND gg1 >= 0.76 THEN 0
         ELSE 1
       END AS gg1_score,
       CASE
         WHEN desc = 'lang_1' AND bb1 >= 0.76 THEN 0
         WHEN desc = 'lang_2' AND bb1 >= 0.76 THEN 0
         ELSE 1
       END AS hh1_score,
       ii1,
       CASE
         WHEN desc = 'lang_12' AND ii1 >= 0.5 THEN 1
         WHEN desc = 'lang_27' AND ii1 >= 0.5 THEN 1
         ELSE 1
       END AS ii1_score,
       jj1,
       CASE
         WHEN desc = 'lang_1' AND jj1 >= 0.8 AND col_name != 'negative' THEN 0
         WHEN desc = 'lang_2' AND jj1 >= 0.8 AND col_name != 'negative' THEN 0
         WHEN desc = 'lang_16' AND jj1 >= 0.8 AND col_name != 'negative' THEN 0
         ELSE 1
       END AS jj1_score,
       kk1,
       ll1,
       mm1,
       CASE
         WHEN ptd = 'DRESS' AND kk1 >= 0.5 THEN 0
         WHEN ptd = 'KURTA' AND kk1 >= 0.5 THEN 0
         WHEN ptd = 'SHIRT' AND ll1 >= 0.5 THEN 0
         WHEN ptd = 'SKIRT' AND mm1 >= 0.5 THEN 0
         WHEN ptd = 'SHORTS' AND mm1 >= 0.5 THEN 0
         WHEN ptd = 'PANTS' AND mm1 >= 0.5 THEN 0
         ELSE 1
       END AS nn1_score,
       pp1,
       CASE
         WHEN department_name = 'Baby' AND pp1 = 1 THEN 0
         WHEN department_name = 'Boy' AND pp1 = 1 THEN 0
         WHEN department_name = 'Girl' AND pp1 = 1 THEN 0
         WHEN department_name = 'Children' AND pp1 = 1 THEN 0
         WHEN department_name = 'Men' AND pp1 = 0 THEN 0
         WHEN department_name = 'Women' AND pp1 = 0 THEN 0
         ELSE 1
       END AS qq1_score,
       rr1,
       ss1,
       cast(rr1 as decimal(2,1)) as bp,
       CASE
         WHEN desc = 'lang_1' AND bp >= 0.8 THEN 0
         ELSE 1
       END AS tt1_score,
       uu1,
       CASE
         WHEN pptt = 'glass' AND uu1 >= 0.5 THEN 0
         ELSE 1
       END AS vv1_score,
       height,
       width,
       CASE
         WHEN height >= 500 OR width >= 500 THEN 1
         ELSE 0
       END AS ww1,
       CASE
         WHEN desc = 'lang_1' OR desc = 'lang_2' OR desc = 'lang_3' OR desc = 'lang_4' OR desc = 'lang_5' THEN (aa1_score*bb1_score*cc1_score*dd1_score*ee1_score*ff1_score*gg1_score*jj1_score*nn1_score*qq1_score*vv1_score*ww1*main_score*count_score)
         WHEN desc = 'lang_6' OR desc = 'lang_7' OR desc = 'lang_8' OR desc = 'lang_9' OR desc = 'lang_10' OR desc = 'lang_11' OR desc = 'lang_12' OR desc = 'lang_13' OR desc = 'lang_14' OR desc = 'lang_15' OR desc = 'lang_16' OR desc = 'lang_17' OR desc = 'lang_18' OR desc = 'lang_19' OR desc = 'lang_20' THEN (bb1_score*cc_score*dd1*ee1_score*ff1_score*gg1_score*ww1*main_score*count_score)
         WHEN desc = 'lang_21' THEN (bb1_score*cc1*dd1*ee1_score*ff1_score*gg1_score*ww1*main_score*count_score)
         WHEN desc = 'lang_22' OR desc = 'lang_23' THEN (bb1_score*cc1*dd1*ee1_score*ff1_score*gg1_score*hh1_score*ww1*main_score*count_score)
         WHEN desc = 'lang_24' THEN (bb1_score*cc1*dd1*ee1_score*ff1_score*gg1_score*jj1_score*ww1*main_score*count_score)
         WHEN desc = 'lang_25' THEN (bb1_score*cc1*dd1*ee1_score*ff1_score*gg1_score*ww1*main_score*count_score)
         WHEN desc = 'lang_26' THEN (bb1_score*cc1*ff1_score*gg1_score*ww1*main_score*count_score)
         WHEN desc = 'lang_27' OR desc = 'lang_28' THEN (cc1*dd1*ff1_score*gg1_score*ii1_score*ww1*main_score*count_score)
         WHEN desc = 'lang_29' THEN (cc1*dd1*ff1_score*gg1_score*ww1*main_score*count_score)
       END AS xx1_score,
       CASE
         WHEN height >= 1000 OR width >= 1000 THEN 1
         ELSE 0
       END AS yy1,
       CASE
         WHEN aa1_score = 0 THEN 'not have aa1. '
         ELSE '|'
       END AS Reject_Reason_aa1,
       CASE
         WHEN bb1_score = 0 THEN 'not have bb1. '
         ELSE '|'
       END AS Reject_Reason_bb1,
       CASE
         WHEN cc1 = 0 THEN 'not have cc1. '
         ELSE '|'
       END AS Reject_Reason_cc1,
       CASE
         WHEN dd1 = 0 THEN 'not be dd1. '
         ELSE '|'
       END AS Reject_Reason_dd1,
       CASE
         WHEN ee1_score = 0 THEN 'not have ee1. '
         ELSE '|'
       END AS Reject_Reason_ee1,
       CASE
         WHEN ff1_score = 0 THEN 'not have ff1/text. '
         ELSE '|'
       END AS Reject_Reason_ff1,
       CASE
         WHEN gg1_score = 0 THEN 'not have gg1. '
         ELSE '|'
       END AS Reject_Reason_gg1,
       CASE
         WHEN ii1_score = 0 THEN 'Image is not ii1. '
         ELSE '|'
       END AS Reject_Reason_ii1,
       CASE
         WHEN ww1 = 0 THEN 'be 500 x 500. '
         ELSE '|'
       END AS Reject_Reason_ww1,
       CASE
         WHEN main_score = 0 THEN 'main present. '
         ELSE '|'
       END AS Reject_Reason_main_score,
       CASE
         WHEN count_score = 0 THEN 'minimum 3. '
         ELSE '|'
       END AS Reject_Reason_count_score,
       CASE
         WHEN hh1_score = 0 THEN 'not have hh1. '
         ELSE '|'
       END AS Reject_Reason_hh1,
       CASE
         WHEN jj1_score = 0 THEN 'not have jj1. '
         ELSE '|'
       END AS Reject_Reason_jj1,
       CASE
         WHEN dress_length_score = 0 THEN 'be kk1. '
         ELSE '|'
       END AS Reject_Reason_kk1,
       CASE
         WHEN face_crop_score = 0 THEN 'not be ll1. '
         ELSE '|'
       END AS Reject_Reason_ll1,
       CASE
         WHEN incorrect_posture_score = 0 THEN 'not be mm1. '
         ELSE '|'
       END AS Reject_Reason_mm1,
       CASE
         WHEN incorrect_sunlangass_score = 0 THEN 'not be nn1. '
         ELSE '|'
       END AS Reject_Reason_nn1,
       CASE
         WHEN xx1_score = 0 THEN 'Reject'
         ELSE 'Accept'
       END AS choice,
       desc,
       TO_DATE('{RUN_DATE_YYYYMMDD}','YYYYMMDD') AS check_date,
       Count,
       main_score,
       count_score
FROM other);

I have also tried looking up other solution from links: 4526628, 42224718, 39501092 but didnt help either.

Any suggestions or support is much apptriciated.

Shri
  • 89
  • 2
  • 8
  • 1
    You have no filters, so you have to read every row from this table and the distinct will require all rows to be sorted (one usually doesn’t expect duplicate rows in a table, is it really necessary?) What are you going to do with all the results once you have them? It would be worth considering the whole process and optimize that rather than this part of your proposed process. – Andrew Sayer Dec 24 '20 at 11:25

4 Answers4

1

Firstly, I would suggest you change the distinct to a group by. A distinct is way more expensive when running a query. See below for more info.

https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct#:~:text=Essentially%2C%20DISTINCT%20collects%20all%20of,performing%20any%20of%20that%20work.

Secondly, check your indexing on the source table.

I wouldn't worry too much about the case statements, other than being a maintenance nightmare it shouldn't be the cause of such a slow performance.

Nys
  • 22
  • 3
1

Here is my advice.

Remove all the case expressions and see if they make a difference. They could, but I would be very surprised if it were the difference between the query running and not running.

If they do, then you can simplify them by understanding that they are evaluated in order. So, they can be simplified. For instance, the first one could simply be:

   (CASE WHEN aa1 >= 0.76 AND desc IN ('lang_1' . . . ) 
         THEN 0 ELSE 1
    END)

This simplifies the code, but probably has minimal impact on performance.

Then there is the possibility that the SELECT DISTINCT makes a difference -- but the reasons would not be what Aaron Bertrand explains in SQL Server. The issue would be that Redshift implements the DISTINCT on a single node, but the GROUP BY is distributed. I'm not 100% sure if this is still the case, but it is worth trying GROUP BYinstead.

Actually, I suspect that SELECT DISTINCT might not even be necessary. I'm not sure why you would have duplicates in this type of data.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
1

Your primary issue is that the disks are filling up. The time it takes to write this much data is significant and when the disks approach 100% full Redshift will move from a performance optimized approach to one that economizes on disk space use. Your case statements may be slow but against this disk space issue not the first place to attack.

You didn't mention how full the disks are before this query runs but I'm assuming that you have at least 25% space available. My first guess when reading this was that you have a join clause that is replicating data and spilling to disk. You just mention that the from clause is just "other" but is this a table or a view or a sub-query? What would be helpful is to post the explain plan for the query and see how the number of rows change with each step of the query. From my experience this is the most likely area to investigate disk consumption issues.

Let's assume my instincts are off and this query is as simple as you posted. To fill up the disks would mean that the source table is very large as compared to the available disk space. Now you are making a temp table and this will take space and you are using DISTINCT which will need to take some scratch space to store the working data as it performs this operation. The disks are filling up implies that the resulting temp table and the scratch space are large - there are a lot of unique combinations going through the DISTINCT. When the number of unique combinations grows large GROUP BY will perform better than DISTINCT but will also need scratch space. You just need space to store all the intermediate results. This is possible if you are starting with a nearly full cluster.

Query patterns like this also make me suspect that redistribution is a factor in the query. A lot of data may also need to move around the network during execution and this can cause significant execution slowdowns. It also can result in nonuniform filling up of disks. If massive row increase isn't the issue you will want to look at svl_query_report for this query and see where time is being spent in the query. This can give important clues to what is happening.

Bill Weiner
  • 8,835
  • 2
  • 7
  • 18
0

Thank you all for the suggestions. Replacing DISTINCT with GROUP BY helped. Following steps resulted in reducing (i) time taken to process by 80% (22hrs to 4hrs)(ii) CPU utilization by 60% (from 100% to 40%) (iii) Disk space consumption by 82% (100% to 18%).

  1. Sort source table and dependent table (ASC or DESC)
  2. Create temp tables
  3. Drop unused temp tables
  4. Replace DISTINCT with GROUP BY where ever applicable- Refer for group by one column with multi column selection: 21217778
  5. Removed redundant case when statements
Shri
  • 89
  • 2
  • 8