1

We are supporting an application which are running huge hive queries triggered via ETL tool.

The query after the mapping runs on hive. The query is very big but its structure looks like this.

INSERT INTO Table2
Select t1.f0, t1.f1,
infUDFCallDouble('TO_FLOAT', t1.f2) as f2, 
SUBSTR(SUBSTR(toString(t1.f4, 'YYYY-MM-DD'), (CASE WHEN 0 < (- 
LENGTH(toString(t1.f4, 'YYYY-MM-DD'))) THEN 0 ELSE 0 END), 10), (CASE WHEN 0 < (- LENGTH(SUBSTR(toString(t1.f4, 'YYYY-MM-DD'), (CASE WHEN 0 < (- LENGTH(toString(t1.f4, 'YYYY-MM-DD'))) THEN 0 ELSE 0 END), 10))) THEN 0 ELSE 0 END), 10) as f4, 
f5, .....
FROM
   (
     Select t1.f0, t1.f1...
     FROM
        (
         SELECT CAST(t1.cust_id AS STRING) as f0, ....
          ...
         FROM sw.v_cust_info_pr t1
         )
     group by t1.f0, t1.f1
     .....t1.f50 )
t1 ; 

I surely don't think, they need to scan the same view multiple times Is there any simpler way to write this. Or surely multiple stage mapping can be merged, is my guess. apart from that, is there any other simpler way to approach the same. I am new and on support side, but this query runs for hours, so need to check if its running in the right way. i am sure one inner select can be discarded, anything else i can try.

Kumar
  • 119
  • 10

1 Answers1

0

Sub-queries like these in your question are not a problem, optimizer will take care about this. There will be no multiple reads of the same data. Your query reads source only one time: FROM sw.v_cust_info_pr t1. The whole query will be executed in two vertices - Mapper 1 and Reducer 2. Use explain select to check this. To speed-up processing, use Tez, CBO, vectorizing, PPD, tune proper parallelism on mappers and reducers. Use map-side aggregation set hive.map.aggr = true;

leftjoin
  • 36,950
  • 8
  • 57
  • 116