1

My code looks like this (sorry, there's a reason I can't show the full code):

public class MyClass {

    final A _field1; // Non-serializable object
    final B _field2; // Non-serializable object

    public void doSomething() {
        myJavaDStream...
                     .mapToPair(t -> {
                         // Do some stuff with _field1 and _field2
                     })
                     .reduceByKey((b1, b2) -> {
                         // Do other stuff with _field1 and _field2
                     })
                     ...
    }
}

public static void main() {
    MyClass myClass = new MyClass();
    myClass.doSomething();
}

Within IntelliJ, everything works just fine. But after building and submitting the jar file with spark-submit, it throws org.apache.spark.SparkException: Task not serializable. The stack trace points to the lambda in mapToPair.

My questions are: What's the difference between running within IDE and in stand-alone mode? How can I make it work properly?

FuzzY
  • 660
  • 8
  • 23
  • 1
    Your class includes a couple of non-serializable objects, so per definition your class is _not_ serializable. When you run inside IntelliJ, everything is run locally to IntelliJ, so it doesn't have to actually distribute the class. When running in stand-alone mode it _will_ have to distribute (and hence serialize) the class, which is why you see the error in stand-alone mode. – Glennie Helles Sindholt May 19 '17 at 08:37

1 Answers1

1

Ok, so I just figured out the solution. For non-serializable classes, esp. from 3rd party lib, you can wrap them with Twitter Chill, which comes with Spark, like so:

import com.twitter.chill.MeatLocker;

public class MyClass {

    final MeatLocker<A> _field1; // Non-serializable object

    public void doSomething() {
        myJavaDStream...
                     .map(t -> {
                         // call _field1.get() instead of _field1 to unwrap the value
                     })
    }
}
Community
  • 1
  • 1
FuzzY
  • 660
  • 8
  • 23