2nd Query: select count(*) from (select distinct a from x) y;
is upto 3.x faster than
1st Query: select count(distinct a) from x;
Please refer to
https://issues.apache.org/jira/browse/HIVE-10568
Executed both the queries in Hive, first query executed in 1 stage with 1 reducer.
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3 Reduce: 1 Cumulative CPU: 46.51 sec HDFS Read: 42857 HDFS Write: 4 SUCCESS
Total MapReduce CPU Time Spent: 46 seconds 510 msec
Second query executed in 2 stages, with improved parallelism.
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3 Reduce: 1 Cumulative CPU: 13.93 sec HDFS Read: 42857 HDFS Write: 115 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 5.83 sec HDFS Read: 510 HDFS Write: 4 SUCCESS
Total MapReduce CPU Time Spent: 19 seconds 760 msec