2

I configured Hive parallelism with below hive-site.xml properties and restarted the cluster

Property 1

Name: hive.exec.parallel
Value: true
Description: Run hive jobs in parallel

Property 2

Name: hive.exec.parallel.thread.number
Value: 8 (default)
Description: Maximum number of hive jobs to run in parallel

To test parallelism, I created below 2 conditions:

1. Single Query in file.hql and Run it as hive -f file.hql

SELECT COL1, COL2 FROM TABLE1
UNION ALL
SELECT COL3, COL4 FROM TABLE2

Result:

When hive.exec.parallel = true, Time taken: 28.015sec, Total MapReduce CPU Time Spent: 3seconds 10msec

When hive.exec.parallel = false, Time taken: 24.778 seconds, Total MapReduce CPU Time Spent: 3 seconds 90 msec.

2. Independent queries in 2 different files as below and run it as nohup hive -f file1.hql & nohup hive -f file2.hql

select count(1) from t1 -> file1.sql
select count(1) from t2 -> file2.sql

Result:

When hive.exec.parallel = false, Time taken: 29.391 seconds, Total MapReduce CPU Time Spent: 1 seconds 890 msec

Question:

How do I check above 2 conditions are indeed running in parallel? In console, I see the result as if queries were running sequentially.

Why the Time taken is more when hive.exec.parallel = true ? How can I see that hive multiple stages are utilized?

Thank you,

leftjoin
  • 36,950
  • 8
  • 57
  • 116
user1
  • 391
  • 3
  • 27

1 Answers1

2

When Hive execution engine is MR (hive.execution.engine=mr), Hive represents query as one or more Map-Reduce jobs, these jobs (each containing Map and reduce) can be executed in parallel if possible. For example this query:

SELECT COL1, COL2 FROM TABLE1
UNION
SELECT COL3, COL4 FROM TABLE2

can be executed as 3 jobs: 1 - select from table1, 2-select table2, 3-UNION (distinct)

First two jobs can be executed in parallel and third one after completion of first and second.

More complex query can be executed as many MR jobs ad these parameters:

hive.exec.parallel and hive.exec.parallel.thread.number allows parallel execution of Jobs for single query running on MR.

You can check jobs on Job Tracker, the URL is printed in the logs during execution. You can see in the logs that some jobs are started and their execution progress.

If running on Tez execution engine(hive.execution.engine=Tez), Hive represents query as a single optimized DAG, omitting unnecessary steps like writing intermediate results into persistent storage and reading them again using mapper. All vertices in the DAG which can be executed in parallel are being executed in parallel. The same settings do not work when running on Tez. It is always running parallel on Tez. The same query will be represented as 2 mapper vertices (running in parallel) and reducer running at the end. The last reducer also can start early when mappers almost completed.

Settings hive.exec.parallel and hive.exec.parallel.thread.number do not affect parallelism of query on Tez, also they do not work for two separate queries in single script.

Two separate queries in single script are running one by one, not parallel (each with it's own task parallelism)

Two hive sessions like in your last example are running in parallel (depends on cluster resources available)

Difference in time can be measured using time Unix command. Time reported by Hive is cluster time. If cluster has no resources available parallel tasks can wait for resources. Use Job tracker to check what exactly happens during execution.

So, actually there are different kinds of parallelism.

Single query Jobs parallelism on MR - parameters you are asking for are for this kind.

Hive sessions are running in parallel - these parameters do not affect it.

Tez vertices parallelism - these parameters do not affect it

Parallel execution of the same vertex instance (mapper or reducer, each can be started more than one) - they are running parallel - these parameters do not affect it

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • Thank you @leftjoin. As you said, hive.exec.parallel is working only for condition 1. The execution time with parelleism = true -> 34 sec. parelleism = false -> 40 sec – user1 Jan 04 '21 at 16:32
  • For condition 2, there is no need of hive.exec.parallel parameter to be set I guess. Irrespective of parralleism = true / false, the execution time was almost similar. When true, 40 sec. When false, 38 / 41 sec. – user1 Jan 04 '21 at 16:33
  • hive.exec.parallel.thread.number = 8 can be increased to any number ? How can we decide the max number of threads to run in parallel ? Thank you @leftjoin – user1 Jan 04 '21 at 16:38
  • 1
    @user1 Yes it can be any number. If nothing to parallel it will work also. For select count(1) from t1 - nothing here can be executed in parallel. But you can keep these settings in all cases. – leftjoin Jan 04 '21 at 16:53
  • Thank you @leftjoin. Only MR and Spark execution engines are available in my cloudera cluster. I see that you focused on Tez execution engine with respect to hive performance improvement among other performance options shown in this link https://www.qubole.com/blog/hive-best-practices/ Which other options can bring drastic performance improvement from your opinion ? Paralleism definitely saved 10sec. – user1 Jan 04 '21 at 17:00
  • 1
    @user1 What affects performance: Vectorization - there are many params for vectorization. Not all of them work w parquet and complex types. Mapper and reducer parallelism(the last in my answer), see also https://stackoverflow.com/a/48296562/2700344 - play with these settings – leftjoin Jan 04 '21 at 17:07
  • 1
    @user1 Mapper parallelism for MR: https://stackoverflow.com/a/48487306/2700344 – leftjoin Jan 04 '21 at 17:13