12

What is the best way to return the max row (value) associated with each unique key in a spark RDD?

I'm using python and I've tried Math max, mapping and reducing by keys and aggregates. Is there an efficient way to do this? Possibly an UDF?

I have in RDD format:

[(v, 3),
 (v, 1),
 (v, 1),
 (w, 7),
 (w, 1),
 (x, 3),
 (y, 1),
 (y, 1),
 (y, 2),
 (y, 3)]

And I need to return:

[(v, 3),
 (w, 7),
 (x, 3),
 (y, 3)]

Ties can return the first value or random.

SiHa
  • 7,830
  • 13
  • 34
  • 43
captainKirk104
  • 177
  • 1
  • 3
  • 7

1 Answers1

22

Actually you have a PairRDD. One of the best ways to do it is with reduceByKey:

(Scala)

val grouped = rdd.reduceByKey(math.max(_, _))

(Python)

grouped = rdd.reduceByKey(max)

(Java 7)

JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
    new Function2<Integer, Integer, Integer>() {
        public Integer call(Integer v1, Integer v2) {
            return Math.max(v1, v2);
    }
});

(Java 8)

JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
    (v1, v2) -> Math.max(v1, v2)
);

API doc for reduceByKey:

Daniel de Paula
  • 17,362
  • 9
  • 71
  • 72
  • can you give a way to do this in Java as well? I am using java and looking for exactly the same thing – tsar2512 Jan 24 '17 at 22:47
  • @tsar2512 With Java 8, this might work: `new JavaPairRDD(rdd).reduceByKey((v1, v2) -> Math.max(v1, v2));` – Daniel de Paula Jan 25 '17 at 09:12
  • thanks for the response, unfortunately, I am using Java 7 - it does not allow lambda functions. One typically has to write anonymous functions. Could you let me know what would be the solution in Java 7? I suspext a simple comparator function should work! – tsar2512 Jan 25 '17 at 09:55
  • Additionally. What we are getting is the max of values which belong to each key. Is that correct? – tsar2512 Jan 25 '17 at 09:56
  • @tsar2512, yes, the resulting RDD will contain a single entry for each key, containing a pair (key, maxValue). I updated the answer with versions for Java 7 and Java 8, but I haven't tested them, so please let me know if it works. – Daniel de Paula Jan 25 '17 at 10:10
  • 1
    @DanieldePaula what if the initial RDD has tuples e.g. `[(v,(125,12)), (v, (130,15)), (w,(200,30)), (w,(250,40))]` instead of just a value and you need to find the result with the max value of the first element such as `[(v, (130,15)),(w,(250,40))]` – user2829319 Sep 23 '21 at 20:46