0
SELECT `col1`
     , `col2`
     , count(*)
FROM `tab1`
GROUP BY `col1`
       , `col2`
limit 10;

+-------+-------+--------+
| col1  | col2  |  _c2   |
+-------+-------+--------+
| A     | A     | 1      |
| A     | B     | 34241  |
| A     | C     | 12345  |
| A     | D     | 145    |
| A     | E     | 26     |
| A     | F     | 224547 |
| B     | A     | 1429   |
| B     | B     | 25     |
| B     | C     | 94     |
| B     | D     | 1      |
+-------+-------+--------+

If I take one of the results from that, and do a specific query for that combination, the result changes.

SELECT `col1`
     , `col2`
     , count(*)
FROM `tab1`
WHERE `col1`='A'
AND   `col2`='B'
GROUP BY `col1`
       , `col2`;

+-------+-------+--------+
| col1  | col2  |  _c2   |
+-------+-------+--------+
| A     | B     | 38944  |
+-------+-------+--------+

If I run set hive.map.aggr=true; then I get a different count, somewhere in between the two.

Any ideas why or how to fix?

If I run the same query with LIMIT 20 then it gives the right count. Or, I should say, the same count as the WHERE query, I haven't counted them myself to check that it is correct!

PhilHibbs
  • 859
  • 1
  • 13
  • 30

0 Answers0