2

Why is the below query taking forever to complete whereas if queried separately returns within seconds?

select count(*),count(distinct a), count(distinct b) from test.tablename;

However, if queried either of the below two queries it returns in seconds. e.g.

select count(*),count(distinct a) from test.tablename; 

or

select count(*),count(distinct b) from test.tablename;

anything to speed up? Already ran a analyze for compute stats. hive configurations are mostly optimized.

UPDATE: apologies. missed the count in the question.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
StrangerThinks
  • 246
  • 4
  • 14

2 Answers2

1

Alternative approach if you do not have too big counts (too big arrays will cause OOM). size(collect_set()) will give you the distinct count.

select count(*), size(collect_set(a)), size(collect_set(b)) from test.tablename;
leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • 1
    I wasnt aware of collect_set or list functions... This performed really well! almost within a minute. INFO : MapReduce Jobs Launched: INFO : Stage-Stage-1: Map: 12 Reduce: 1 Cumulative CPU: 124.16 sec HDFS Read: 95745620 HDFS Write: 24 SUCCESS INFO : Total MapReduce CPU Time Spent: 2 minutes 4 seconds 160 msec I see this used 12 mappers and 1 reducer. Can you please explain why it messes up mr performance if I used plain old distinct on the query? and why it doesnt happen when querying the other two ,but only when all three requested? I would love to mark this as answered – StrangerThinks Nov 13 '20 at 22:35
  • @StrangerThinks Because different distincts require data to be sorted and distributed by different keys, this is not possible to do at the same step (reducer), multiple steps are involved each sorting and aggregating different keys. In case of collect_set, distinct arrays are aggregated on the same reducer vertex. This technic requires more memory to fit all sets. – leftjoin Nov 14 '20 at 12:34
  • @StrangerThinks see also: https://stackoverflow.com/a/61772631/2700344 – leftjoin Nov 14 '20 at 14:51
0

I recall that Hive does a poor job implementing count(distinct). Is it faster like this?

select (select count(*) from test.tablename),
       (select count(*) from (select a from test.tablename group by a) x),
       (select count(*) from (select b from test.tablename group by b) x)

If you have relatively few values of a and b, then this might also have decent performance:

select sum(cnt),
       sum(case when seqnum_a = 1 then 1 else 0 end),
       sum(case when seqnum_b = 1 then 1 else 0 end)
from (select a, b, count(*) as cnt,
             row_number() over (partition by a order by a) as seqnum_a,
             row_number() over (partition by b order by b) as seqnum_b
      from test.tablename
      group by a, b
     ) ab;

And your question is really "why". Hive is a parallel database but some operations are performed on a single processor. Once upon a time, count(distinct) was one of those operations. However, that might have been fixed . . . if there is only one count(distinct) in the query. Multiple such expressions might require sending all the data to a single node to be processed there -- a real performance killer.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • :). apologiies... i just made the edit... missed the count clause in the original question – StrangerThinks Nov 12 '20 at 21:15
  • Aah the group by alternate approach. I did see several options on stack, but will definitely try your suggestion and update here. Rest assured most of suggested params are set for best performance though. ref. https://stackoverflow.com/questions/48295667/hive-query-performance-for-high-cardinality-field/48296562#48296562 as well as hive.optimize.countdistinct=true; – StrangerThinks Nov 12 '20 at 22:15