1

I'm try to execute the below hive query on Azure HDInsight cluster but it's taking unprecedented amount of time to finish. Did implemented hive settings but of no use. Below are the details:

Table

CREATE TABLE DB_MYDB.TABLE1(
  MSTR_KEY STRING,
  SDNT_ID STRING,
  CLSS_CD STRING,
  BRNCH_CD STRING,
  SECT_CD STRING,
  GRP_CD STRING,
  GRP_NM STRING,
  SUBJ_DES STRING,
  GRP_DESC STRING,
  DTL_DESC STRING,
  ACTV_FLAG STRING,
  CMP_NM STRING)
STORED AS ORC
TBLPROPERTIES ('ORC.COMPRESS'='SNAPPY');

Hive Query

INSERT OVERWRITE TABLE DB_MYDB.TABLE1
SELECT
CURR.MSTR_KEY,
CURR.SDNT_ID,
CURR.CLSS_CD,
CURR.BRNCH_CD,
CURR.SECT_CD,
CURR.GRP_CD,
CURR.GRP_NM,
CURR.SUBJ_DES,
CURR.GRP_DESC,
CURR.DTL_DESC,
'Y',
CURR.CMP_NM
FROM DB_MYDB.TABLE2 CURR
LEFT OUTER JOIN DB_MYDB.TABLE3 PREV
ON (CURR.SDNT_ID=PREV.SDNT_ID
AND CURR.CLSS_CD=PREV.CLSS_CD
AND CURR.BRNCH_CD=PREV.BRNCH_CD
AND CURR.SECT_CD=PREV.SECT_CD
AND CURR.GRP_CD=PREV.GRP_CD
AND CURR.GRP_NM=PREV.GRP_NM)
WHERE PREV.SDNT_ID IS NULL;

But the query is running for hours. Below is the detail:

--------------------------------------------------------------------------------
    VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED     46         46        0        0       0       0
Map 3 ..........   SUCCEEDED    169        169        0        0       0       0
Reducer 2 ....       RUNNING   1009        825      184        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/03  [======================>>----] 84%   ELAPSED TIME: 13622.73 s  
--------------------------------------------------------------------------------

I did set some hive properties

SET hive.execution.engine=tez;
SET hive.tez.container.size=10240;
SET tez.am.resource.memory.mb=10240;
SET tez.task.resource.memory.mb=10240;
SET hive.auto.convert.join.noconditionaltask.size=3470;
SET hive.vectorized.execution.enabled = true;
SET hive.vectorized.execution.reduce.enabled=true;
SET hive.vectorized.execution.reduce.groupby.enabled=true;
SET hive.cbo.enable=true;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
SET hive.compute.query.using.stats=true;
SET hive.merge.mapfiles = true;
SET hive.merge.mapredfiles = true;
SET hive.merge.tezfiles = true;
SET hive.merge.size.per.task=268435456;
SET hive.merge.smallfiles.avgsize=16777216;
SET hive.merge.orcfile.stripe.level=true;

Records in Tables:

DB_MYDB.TABLE2= 337319653

DB_MYDB.TABLE3= 1946526625

There doesn't seem to be any impact on the query. Can anyone help me to:

  1. Understand that why this query is not completing and taking indefinite time?
  2. How can I optimize it to work faster and complete?

Using the versions:

Hadoop 2.7.3.2.6.5.3033-1
Hive 1.2.1000.2.6.5.3033-1
Azure HDInsight 3.6

Attempt_1:

As suggested by @leftjoin tried to set the set hive.exec.reducers.bytes.per.reducer=32000000;. This worked until the second last step of the hive script but at the last it failed with Caused by: java.io.IOException: Map_1: Shuffle failed with too many fetch failures and insufficient progress!

Last Query:

INSERT OVERWRITE TABLE DB_MYDB.TABLE3
SELECT
 CURR_FULL.MSTR_KEY,
 CURR_FULL.SDNT_ID,
 CURR_FULL.CLSS_CD,
 CURR_FULL.BRNCH_CD,
 CURR_FULL.GRP_CD,
 CURR_FULL.CHNL_CD,
 CURR_FULL.GRP_NM,
 CURR_FULL.GRP_DESC,
 CURR_FULL.SUBJ_DES,
 CURR_FULL.DTL_DESC,
 (CASE WHEN CURR_FULL.SDNT_ID = SND_DELTA.SDNT_ID THEN 'Y' ELSE 
 CURR_FULL.SDNT_ID_FLAG END) AS SDNT_ID_FLAG,
 CURR_FULL.CMP_NM
 FROM
   DB_MYDB.TABLE2 CURR_FULL
   LEFT OUTER JOIN DB_MYDB.TABLE1 SND_DELTA
   ON (CURR_FULL.SDNT_ID = SND_DELTA.SDNT_ID);


----------------------------------------------------------------- 
VERTICES    STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED KILLED
-----------------------------------------------------------------
Map 1 .........  RUNNING  1066    1060     6     0     0    0
Map 4 .......... SUCCEEDED   3     3       0     0     0    0
Reducer 2        RUNNING   1009    0       22    987   0    0
Reducer 3        INITED      1     0       0     1     0    0
-----------------------------------------------------------------
VERTICES: 01/04  [================>>--] 99%   ELAPSED TIME: 18187.78 s   

Error:

Caused by: java.io.IOException: Map_1: Shuffle failed with too many fetch failures and insufficient progress!failureCounts=8, pendingInputs=1058, fetcherHealthy=false, reducerProgressedEnough=false, reducerStalled=false

UnicornUnion
  • 329
  • 1
  • 12

2 Answers2

1

if you don't have index on your fk columns , you should add them for sure , here is my suggestion:

create index idx_TABLE2 on table DB_MYDB.TABLE2 (SDNT_ID,CLSS_CD,BRNCH_CD,SECT_CD,GRP_CD,GRP_NM) AS 'COMPACT' WITH DEFERRED REBUILD;

create index idx_TABLE3 on table DB_MYDB.TABLE3(SDNT_ID,CLSS_CD,BRNCH_CD,SECT_CD,GRP_CD,GRP_NM) AS 'COMPACT' WITH DEFERRED REBUILD;

be noticed from hive version 3.0 , indexing has been removed from hive and alternatively you can use materialized views (supported from Hive 2.3.0 and above) that gives you the same performance.

eshirvana
  • 23,227
  • 3
  • 22
  • 38
  • Thanks for your input. But I'm using `Tez` as the execution engine. Getting the error message `FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Indexes unsupported for Tez execution engine` while trying to create `index` – UnicornUnion Jun 08 '21 at 20:24
  • @UnicornUnion hmmm... maybe set the validation check to ignore before index creation : `set hive.msck.path.validation=ignore;` – eshirvana Jun 08 '21 at 20:27
1

If it is reducer vertex which is runnong slow, you can increase reducer parallelism by reducing bytes per reducer configuration. Check your current setting and reduce figure accordingly untill you get 2x or more reducers running:

set hive.exec.reducers.bytes.per.reducer=67108864; --example only, check your current settings 
                                                   --and reduce accordingly to get twice more reducers on Reducer 2 vertex

Change setting, start query, check the number of containers on Reducer 2 vertex, terminate and change again if the number of containers has not increased.

If you want to increase parallelism on mappers also, read this answer: https://stackoverflow.com/a/48487306/2700344

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • Thanks. Your suggestion worked until the second last step. At the last part of the Hive script it's failing with `Caused by: java.io.IOException: Map_1: Shuffle failed with too many fetch failures and insufficient progress!`. I've updated the question with the details. – UnicornUnion Jun 08 '21 at 22:51
  • @UnicornUnion Now it is failing in another query and on Mapper1, not reducer. Please avoid adding new questions to your initial one because it is wasting of our time: nobody happy with increasing scope and answers never accepted because of scope has changed. As for your new query, I'd suggest you to combite first query and the last one. Instead of loading intermediate table, use subquery, at least that currently failing mapper step will be eliminated completely – leftjoin Jun 09 '21 at 07:50
  • Sorry, had no intention to waste anybody's time here. Your suggestion did helped with few more tweaks. – UnicornUnion Jun 09 '21 at 08:45
  • 1
    @UnicornUnion check if joins are performed on unique columns in right-side joined table and does not duplicate, fix accordingly, this should be done for all joins. Also read about how to detect and fix skew on join: https://stackoverflow.com/a/51061613/2700344 and https://stackoverflow.com/a/66269057/2700344 – leftjoin Jun 09 '21 at 10:49