0

I'm new to using Tez engine. I'm running hive queries on Tez engine, and the query seems to utilize all the available resource. I'd like to know if there is any way to control the number of running containers. For eg., how we control in spark using --executor-cores and --num-executors configuration.

I've searched and was not able to find anything concrete. Also, I don't want to differentiate it via queue (Since I'm running it on EMR with scaling options and defining scaling based on multiple queue complicates the setup).

Update 1 : With vertice information


        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1            container       RUNNING     17          0       11        6       0       0
----------------------------------------------------------------------------------------------

The above query triggers 1 vertice in which 11 tasks are running in parallel(using all 11 resource of the cluster). I'd want to control the number of concurrently running task within the vertice (in this example from 11 to 3).

Makubex
  • 419
  • 3
  • 19
  • Read this answer: https://stackoverflow.com/a/42842117/2700344 – leftjoin Aug 03 '20 at 21:59
  • @leftjoin, Thanks for the response. I'm not specifically looking to control the number of mapper. I'd like to control the concurrently(how many map task should run at a given time). I've tried setting "SET tez.am.vertex.max-task-concurrency=3 ", but it doesn't seem to work. Not sure if I'm missing something. – Makubex Aug 04 '20 at 04:44
  • Are you trying to control how vertices are running? Tez builds a DAG and DAG itself determines which vertices can be executed in parallel and which are waiting for other vertices. – leftjoin Aug 04 '20 at 06:49
  • Tasks within mapper or reducer vertex - are map and reduce tasks (also known as containers). and I already answered how to control them, see the link in my first comment – leftjoin Aug 04 '20 at 07:05
  • @leftjoin, Not the vertices, but the number of concurrent running task within a vertice. I've updated the question with the vertice detail. – Makubex Aug 04 '20 at 07:06
  • Actually it seems like you are trying to limit resources used without using queues... https://issues.apache.org/jira/browse/TEZ-2914 – leftjoin Aug 04 '20 at 07:14
  • @leftjoin, Yeah, you're correct. I also found this as a plausible solution (stated in my first comment), but for some reason it doesn't seem to work. – Makubex Aug 04 '20 at 07:27
  • By controlling the size of data processed by single container like in this answer https://stackoverflow.com/a/42842117/2700344 You can create less "bigger" (actually they are not necessarily bigger) containers, they will run slower in parallel. It is similar to many smaller containers waiting and some are running, but few bigger containers instead of meny small ones will consume less resources finally. – leftjoin Aug 04 '20 at 07:35

1 Answers1

0

Settings for small datasets queries:

set hive.vectorized.execution.enabled=true;
set hive.vectorized.execution.reduce.enabled=true;
set hive.exec.parallel=true;
set hive.auto.convert.join=true;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.exec.compress.output=true;
set hive.exec.compress.intermediate=true;
set hive.tez.container.size=10240;
set hive.tez.java.opts=-Xmx8192m;
set tez.runtime.io.sort.mb=4096;
set tez.grouping.min-size=16777216;
set tez.grouping.max-size=1073741824; 
set tez.grouping.split-count=8;
set hive.exec.reducers.bytes.per.reducer=256000000;
hive.exec.reducers.max=10;
set hive.tez.auto.reducer.parallelism = true;
set tez.runtime.unordered.output.buffer.size-mb=1024;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

-- configs for bigger datasets :

set hive.execution.engine=tez;
set hive.vectorized.execution.enabled=true;
set hive.vectorized.execution.reduce.enabled=true;
set hive.exec.parallel=true;
set hive.auto.convert.join=true;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.exec.compress.output=true;
set hive.exec.compress.intermediate=true;
set hive.tez.container.size=10240;
set hive.tez.java.opts=-Xmx8192m;
set tez.runtime.io.sort.mb=4096;
set tez.runtime.unordered.output.buffer.size-mb=1024;
set tez.grouping.min-size=1073741824;
set tez.grouping.max-size=1073741824;
set tez.grouping.split-count=16;
set hive.exec.reducers.bytes.per.reducer=512000000;
hive.exec.reducers.max=10;
set hive.tez.auto.reducer.parallelism = true;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

Note: Some of the configs mentioned might not be supported due to your hive or Tez versions as well as your platform permissions.

sathya
  • 1,982
  • 1
  • 20
  • 37