I have a Hive installation with Tez as an execution engine. When I run simple queries (select * from table where col = value
), it seems Hive doesn't rely on Tez at all.
Explain looks like
Plan optimized by CBO.
2
3 Stage-0
4 Fetch Operator
5 limit:-1
6 Select Operator [SEL_2]
7 Output:
8 {lots of columns}
8 Filter Operator [FIL_4]
9 predicate:(UDFToDouble(col) = value)
10 TableScan [TS_0]
11 Output:
12 {lots of columns}
There are no 'stages' shown in Hive UI for the query, no DAGs in Tez UI, and I don't see any traces of Tez invoke in logs as well. Based on hive logs and CPU consumption all processing happening on the master node, which makes it a bottleneck basically. As far as I could understand it even process files in serial mode, which make things even worse in terms of timings.
On more complicated queries, Tez is involved and processing gets distributed across workers of the cluster. Explain looks like
1 Plan optimized by CBO.
2
3 Vertex dependency in root stage
4 Map 1 <- Reducer 3 (BROADCAST_EDGE)
5 Reducer 3 <- Map 2 (SIMPLE_EDGE)
...
So, my questions are:
- based on what Hive decide when to process data via Tez, and when not? Any explanation comments or links will be much appreciated;
- is there any way to force the use of distributed processing to unload the master node?