SELECT `col1`
, `col2`
, count(*)
FROM `tab1`
GROUP BY `col1`
, `col2`
limit 10;
+-------+-------+--------+
| col1 | col2 | _c2 |
+-------+-------+--------+
| A | A | 1 |
| A | B | 34241 |
| A | C | 12345 |
| A | D | 145 |
| A | E | 26 |
| A | F | 224547 |
| B | A | 1429 |
| B | B | 25 |
| B | C | 94 |
| B | D | 1 |
+-------+-------+--------+
If I take one of the results from that, and do a specific query for that combination, the result changes.
SELECT `col1`
, `col2`
, count(*)
FROM `tab1`
WHERE `col1`='A'
AND `col2`='B'
GROUP BY `col1`
, `col2`;
+-------+-------+--------+
| col1 | col2 | _c2 |
+-------+-------+--------+
| A | B | 38944 |
+-------+-------+--------+
If I run set hive.map.aggr=true;
then I get a different count, somewhere in between the two.
Any ideas why or how to fix?
If I run the same query with LIMIT 20
then it gives the right count. Or, I should say, the same count as the WHERE
query, I haven't counted them myself to check that it is correct!