I am trying to run the EnronMail example of Hadoop-MongoDB Connector for Spark. Therefore I am using the java code example from GitHub: https://github.com/mongodb/mongo-hadoop/blob/master/examples/enron/spark/src/main/java/com/mongodb/spark/examples/enron/Enron.java I adjusted the server name and added username and password according to my needs.
The error message I got it the following:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2066)
at org.apache.spark.rdd.RDD$$anonfun$flatMap$1.apply(RDD.scala:333)
at org.apache.spark.rdd.RDD$$anonfun$flatMap$1.apply(RDD.scala:332)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.flatMap(RDD.scala:332)
at org.apache.spark.api.java.JavaRDDLike$class.flatMap(JavaRDDLike.scala:130)
at org.apache.spark.api.java.AbstractJavaRDDLike.flatMap(JavaRDDLike.scala:46)
at Enron.run(Enron.java:43)
at Enron.main(Enron.java:104)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: Enron
Serialization stack:
- object not serializable (class: Enron, value: Enron@62b09715)
- field (class: Enron$1, name: this$0, type: class Enron)
- object (class Enron$1, Enron$1@ee8e7ff)
- field (class: org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1, name: f$3, type: interface org.apache.spark.api.java.function.FlatMapFunction)
- object (class org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
... 22 more
I then created a new class for the FlatMapFunction and extended the Enron class by this class. This couldn't solve the problem. Any ideas how to solve this issue?
class FlatMapFunctionSer implements Serializable{
static FlatMapFunction<Tuple2<Object, BSONObject>, String> flatFunc = new FlatMapFunction<Tuple2<Object, BSONObject>, String>() {
@Override
public Iterable<String> call(final Tuple2<Object, BSONObject> t) throws Exception {
BSONObject header = (BSONObject) t._2().get("headers");
String to = (String) header.get("To");
String from = (String) header.get("From");
// each tuple in the set is an individual from|to pair
//JavaPairRDD<String, Integer> tuples = new JavaPairRDD<String, Integer>();
List<String> tuples = new ArrayList<String>();
if (to != null && !to.isEmpty()) {
for (String recipient : to.split(",")) {
String s = recipient.trim();
if (s.length() > 0) {
tuples.add(from + "|" + s);
}
}
}
return tuples;
}
};
}