0

I'm running Hive 1.1.0 and see that for two bigint columns, active_users and inactive_users, SUM(active_users + inactive_users) < SUM(active_users) + SUM(inactive_users).

Why is that the case, given it should be a simple addition of integers?

2 Answers2

3

This is the case when one but not both values are NULL.

So, consider:

 a     b
 1     1
 2     NULL
 NULL  3

Then sum(a) + sum(b) is the sum of: (1 + 2) and (1 + 3) = 7. The NULL is ignored.

However, sum(a + b) is the sum of:

 (1 + 1)
 (2 + NULL)
 (NULL + 3) 

This evaluates to:

 2
 NULL
 NULL

The NULLs are ignored, so the result is 2.

In other words, SUM() ignores NULL values, but + doesn't.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

There are 2 possibilities:

1) Not handling NULL value if columns are nullable

SUM(active_users + inactive_users) 
SUM(active_users) + SUM(inactive_users)

-- should be
SUM(COALESCE(active_users,0) + COALESCE(inactive_users,0)) 
SUM(COALESCE(active_users,0)) + SUM(COALESCE(inactive_users,0))

2) If column has inexact data type like FLOAT round error could accumulate.

More info: Avg of float inconsistency

Lukasz Szozda
  • 162,964
  • 23
  • 234
  • 275