Comparing two different methods in Spark: for reducing and sorting

Question

Assuming that I am having the following RDD:

alist = [('a',[['1',2]]),('b',[['2',3]]),('b',[['8',5]]),('b',[['8',5]]),('c',[['4',22]]),('a',[['5',22]])]
anRDD = sc.parallelize(alist)

My task is from each string letter get the list with the highest int value (index 1 of the list). If there is a huge amount of data and a lot of different keys (string letters) what of the following methods is recommended?

Method 1:

import operator

def sortAndTake(alistoflists):
    alistoflists.sort(key=operator.itemgetter(1),reverse=True)
    return alistoflists[0]

reducedRDD = anRDD.reduceByKey(lambda a,b:a+b)
finalRDD = reducedRDD.map(lambda x: (x[0],sortAndTake(x[1])))
finalRDD.collect()

Method 2:

def partitioner(n):
    def partitioner_(x):
        return portable_hash(x[0]) % n
    return partitioner_

def sortIterator(iterator):
    oldKey = None
    cnt = 0
    for item in iterator:
        if item[0] != oldKey:
            oldKey = item[0]
            yield item

partitioned = anRDD.keyBy(lambda kv:(kv[0],kv[1][0][1]))

partitioned.repartitionAndSortWithinPartitions(
                                 numPartitions=2,
                                 partitionFunc=partitioner(2),ascending=False)
           .map(lambda x: x[1])
           .mapPartitions(sortIterator)

(method 2 is inspired from the accepted answer(by zero323) from a previous question I have made: Using repartitionAndSortWithinPartitions)

From my understanding in the first method if we got a huge different key values there is a lot of shuffling between the workers in the reduceByKey which could make the method 2 quicker ( I am not sure if the same is happening when using repartitionAndSortWithinPartitions in method2).

Any insight? Thanks :)

score 2 · Accepted Answer · answered Aug 15 '16 at 15:22

2

My task is from each string letter get the list with the highest int value(index 1 of the list).

If that's the case both methods are very inefficient. Instead just reduceByKey withmax:

from operator import itemgetter
from functools import partial

anRDD.mapValues(itemgetter(0)).reduceByKey(partial(max, key=itemgetter(1)))

Regarding two proposed methods:

Both shuffle the same amount of data.
The first one is just less efficient groupByKey.

answered Aug 15 '16 at 15:22

zero323

322,348
103
959
935

I really like the way of implimenting this and I was wondering how your answer that you provided could be extended to get N biggest values and not only the max. Could you give some insight? – Mpizos Dimitris Aug 16 '16 at 09:02
You could use [`np.partition`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.partition.html#numpy.partition) to get the topK. Note the topK is unsorted. – shuaiyuancn Aug 16 '16 at 10:53
@MpizosDimitris You would need a slightly different approach here. – zero323 Aug 16 '16 at 12:47
@zero323 . So you wouldn't use the method 1 that I got in my question? Can you specify a little bit more? – Mpizos Dimitris Aug 16 '16 at 13:51
1

@MpizosDimitris Personally I would aggregate with binary heap. – zero323 Aug 16 '16 at 15:25
Thanks! can you direct me to a link with a general example using this method in spark? – Mpizos Dimitris Aug 17 '16 at 07:48

Comparing two different methods in Spark: for reducing and sorting

1 Answers1