I was told that count(distinct ) may result in data skew because only one reducer is used.
I made a test using a table with 5 billion data with 2 queries,
Query A:
select count(distinct columnA) from tableA
Query B:
select count(columnA) from
(select columnA from tableA group by columnA) a
Actually, query A takes about 1000-1500 seconds while query B takes 500-900 seconds. The result seems expected.
However, I realize that both queries use 370 mappers
and 1 reducers
and thay have almost the same cumulative CPU seconds
. And this means they do not have geneiune difference and the time difference may caused by cluster load.
I am confused why the all use one 1 reducers and I even tried mapreduce.job.reduces
but it does not work. Btw, if they all use 1 reducers why do people suggest not to use count(distinct )
and it seems data skew is not avoidable?