1

I was told that count(distinct ) may result in data skew because only one reducer is used.

I made a test using a table with 5 billion data with 2 queries,

Query A:

select count(distinct columnA) from tableA

Query B:

select count(columnA) from
(select columnA from tableA group by columnA) a

Actually, query A takes about 1000-1500 seconds while query B takes 500-900 seconds. The result seems expected.

However, I realize that both queries use 370 mappers and 1 reducers and thay have almost the same cumulative CPU seconds. And this means they do not have geneiune difference and the time difference may caused by cluster load.

I am confused why the all use one 1 reducers and I even tried mapreduce.job.reduces but it does not work. Btw, if they all use 1 reducers why do people suggest not to use count(distinct ) and it seems data skew is not avoidable?

leftjoin
  • 36,950
  • 8
  • 57
  • 116
user2894829
  • 775
  • 1
  • 6
  • 26

1 Answers1

1

Both queries are using the same number of mappers which is expected and single final reducer, which is also expected because you need single scalar count result. Multiple reducers on the same vertex are running independently, isolated and each will produce it's own output, this is why the last stage has single reducer. The difference is in the plan.

In the first query execution single reducer reads each mapper output and does distinct count calculation on all the data, it process too much data.

Second query is using intermediate aggrgation and final reducer receives partially aggregated data (distinct values aggregated on previous step). Final reducer needs to aggregate partial results again to get final result, it can be much less data than in the first case.

As of Hive 1.2.0 there is optimization for count(distinct) and you do not need to rewrite query. Set this property: hive.optimize.distinct.rewrite=true

Also there is mapper aggregation (mapper can pre-aggregate data also and produce distinct values in the scope of their portion of data - splits) Set this property to allow map-side aggregation: hive.map.aggr=true

use EXPLAIN command to check the difference in the execution plan.

See also this answer: https://stackoverflow.com/a/51492032/2700344

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • Thanks for your detailed reply! When I run `Query A` without setting any property, I realize that the progress percentage of reducer sometimes decreases, e.g.: `17%, 18%, 0%, 0%, 0%, 0%, 18%, 0%, 0%, 17%, 20%, 21%, ...` while the progress percentage of mapper increases normally. Why could this happen? – user2894829 May 13 '20 at 11:18
  • @user2894829 Check the reducers execution log. Maybe some reducers are failing. Try to run on Tez. It will show negatibe numbers if reducers are failing: https://stackoverflow.com/a/45185353/2700344 – leftjoin May 13 '20 at 17:33
  • Thanks for your infomation! The last question about `set hive.map.aggr=true;`. I read some materials and get 2 infos: 1) This property is already set true by default in recent Hive. 2) A bloger says this property may lead to WRONG result for `count(*)` and he has to change it to `sum(1)` and then get correct result. Is this just a temporary bug? https://yq.aliyun.com/articles/622421 sorry this is written in Chinese but 2 Querys are outstanding. – user2894829 May 14 '20 at 02:30
  • @user2894829 I do not know about this bug. Test it. If it is a bug in your Hive version, create a Jira ticket.https://issues.apache.org/jira/projects/HIVE/issues – leftjoin May 14 '20 at 08:48
  • @user2894829 that is rather outdated. it was posted in 2015. Skip it. – leftjoin May 14 '20 at 08:53