Hive LIMIT changes GROUP BY result

Asked Feb 19 '20 at 11:08

Active Feb 20 '20 at 12:22

Viewed 84 times

SELECT `col1`
     , `col2`
     , count(*)
FROM `tab1`
GROUP BY `col1`
       , `col2`
limit 10;

+-------+-------+--------+
| col1  | col2  |  _c2   |
+-------+-------+--------+
| A     | A     | 1      |
| A     | B     | 34241  |
| A     | C     | 12345  |
| A     | D     | 145    |
| A     | E     | 26     |
| A     | F     | 224547 |
| B     | A     | 1429   |
| B     | B     | 25     |
| B     | C     | 94     |
| B     | D     | 1      |
+-------+-------+--------+

If I take one of the results from that, and do a specific query for that combination, the result changes.

SELECT `col1`
     , `col2`
     , count(*)
FROM `tab1`
WHERE `col1`='A'
AND   `col2`='B'
GROUP BY `col1`
       , `col2`;

+-------+-------+--------+
| col1  | col2  |  _c2   |
+-------+-------+--------+
| A     | B     | 38944  |
+-------+-------+--------+

If I run set hive.map.aggr=true; then I get a different count, somewhere in between the two.

Any ideas why or how to fix?

If I run the same query with LIMIT 20 then it gives the right count. Or, I should say, the same count as the WHERE query, I haven't counted them myself to check that it is correct!

edited Feb 20 '20 at 12:22

asked Feb 19 '20 at 11:08

PhilHibbs

This may be more suitable to dba.stackexchange.com so if I have no luck here I might delete-and-repost. – PhilHibbs Feb 19 '20 at 12:25
My workaroud for now is to add `HAVING count(*) > 0` – PhilHibbs Feb 20 '20 at 12:53
This may be the issue with statistics: https://stackoverflow.com/a/39914232/2700344 – leftjoin Feb 23 '20 at 08:04

Hive LIMIT changes GROUP BY result

0 Answers0