pySpark find Median in a distributed way?

Question

Is it possible to find median in spark in a distributed way? I am currently finding: Sum, Average, Variance, Count using the following code:

dataSumsRdd = numRDD.filter(lambda x: filterNum(x[1])).map(lambda line: (line[0], float(line[1])))\
    .aggregateByKey((0.0, 0.0, 0.0),
     lambda (sum, sum2, count), value: (sum + value, sum2 + value**2, count+1.0),
     lambda (suma, sum2a, counta), (sumb, sum2b, countb): (suma + sumb, sum2a + sum2b, counta + countb))
#Generate RDD of Count, Sum, Average, Variance
dataStatsRdd = dataSumsRdd.mapValues(lambda (sum, sum2, count) : (count, sum, sum/count, round(sum2/count - (sum/count)**2, 7)))

I am not quite sure how to find Median though. To find standard deviation I just do the result locally with square rooting variance. Once I gather median I can than easily do Skewness locally as well.

I have my data in Key/Value pairs (key = column)

Take a look at [this question](http://stackoverflow.com/questions/28158729/how-can-i-calculate-exact-median-with-apache-spark). An efficient distributed median algorithm is not straightforward. — nrg, Apr 28 '15 at 15:20

score 1 · Accepted Answer · answered Apr 28 '15 at 18:14

1

What I am looking at is (its not the best way... but the only way I can think of doing it):

def medianFunction(x):
    count = len(x)
    if count % 2 == 0:
        l = count / 2 - 1
        r = l + 1
        value = (x[l - 1] + x[r - 1]) / 2
        return value
    else:
        l = count / 2
        value = x[l - 1]
        return value

   medianRDD = numFilterRDD.groupByKey().map(lambda (x, y): (x, list(y))).mapValues(lambda x: medianFunction(x)).collect()

answered Apr 28 '15 at 18:14

theMadKing

2,064
7
32
59

The line medianRDD = ends with a .collect(). Is it intentional? Did you test this solution on a little test data? – Geoffrey Anderson Jul 08 '16 at 19:02
.collect is an action, and this doing something that will produce an output that isn't dangerous for the driver. What is your concern? – theMadKing Jul 08 '16 at 21:51

pySpark find Median in a distributed way?

1 Answers1