2

Given a table

+------------+-----------+
 |  Number   |  Count    |
+------------+-----------+
 |  0        |  7        |
+------------+-----------+
 |  1        |  1        |
+------------+-----------+
 |  2        |  3        |
+------------+-----------+
 |  4        |  1        |
+------------+-----------+

Which is representing such a number sequence: 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 4

find the median number, in this case it's 0, with sql. You will need to run this query in hive (qubole)

Thoughts?

Lenix
  • 23
  • 3
  • @Drew, can you point to the question this is a duplicate of? I searched but could not find one. This isn't a standard median calculation. –  Oct 06 '15 at 06:45

1 Answers1

1

There is a fairly straightforward solution in Hive. You'll need this UDF here. Essentially, you want to un-aggregate your count data and then percentile it.

Query:

add jar /path/to/jar/brickhouse-0.7.1.jar;
create temporary function numeric_range as 'brickhouse.udf.collect.NumericRange';

select percentile(number, 0.50) median
from (
  select number
  from db.table
  lateral view numeric_range(count) n1 as n) x 

The inner query will produce

0
0
0
0
0
0
0
1
2
2
2
4

Then you can just use the percentile() function on this column

Output:

median
------
0.0
o-90
  • 17,045
  • 10
  • 39
  • 63