0

I'm trying to take the 25 top items of a JavaPairRDD like this:

JavaPairRDD rdd = ...;
List<Tuple2<String, Long>> top25 = rdd.top(25, (t1, t2) -> {
                                        if (!t1._2.equals(t2._2)) {
                                            return -1 * Long.compare(t1._2, t2._2);
                                        }
                                        else {
                                            return t1._1.compareTo(t2._1);
                                        }
                                    })

This is sorting based on first the value and if values are equal, then on the keys. When I run it, I get the following exception:

Exception in thread "main" org.apache.spark.SparkException: Task not serializable

I think the problem is that the inline lambda function playing the role of Comparator is not serializable.

I've got two questions. First, assuming my assumption is correct, why the Comparator needs to be serializable? And second, how to solve this problem?

Mehran
  • 15,593
  • 27
  • 122
  • 221
  • What if the `Comparator` contains state. Should not it be serialized to preserve the state? – tsolakp Oct 26 '18 at 23:09
  • Personally, I have never faced a stateful comparator! I mean it is thinkable but not the everyday case. Usually, all a comparator needs to perform is the two items to compare. – Mehran Oct 26 '18 at 23:11
  • 1
    Try to use custom comparator class that also implements `Serializable` and see if it works. – tsolakp Oct 26 '18 at 23:13
  • Thanks, it worked but I was hoping I could implement it with just a lambda function! – Mehran Oct 26 '18 at 23:24
  • 1
    Quick search leads to this very interesting lambda functionality: https://stackoverflow.com/questions/22807912/how-to-serialize-a-lambda – tsolakp Oct 26 '18 at 23:26

0 Answers0