Spark Graphx : class not found error on EMR cluster

Question

I am trying to process Hierarchical Data using Grapghx Pregel and the code I have works fine on my local.

But when I am running on my Amazon EMR cluster it is giving me an error:

java.lang.NoClassDefFoundError: Could not initialize class

What would be the reason of this happening? I know the class is there in the jar file as it run fine on my local as well there is no build error.

I have included GraphX dependency on pom file.

Here is a snippet of code where error is being thrown:

def calcTopLevelHierarcy (vertexDF: DataFrame, edgeDF: DataFrame): RDD[(Any, (Int, Any, String, Int, Int))]  =
 {

  val verticesRDD = vertexDF.rdd
                  .map { x => (x.get(0), x.get(1), x.get(2)) }
                  .map { x => (MurmurHash3.stringHash(x._1.toString).toLong, (x._1.asInstanceOf[Any], x._2.asInstanceOf[Any], x._3.asInstanceOf[String])) }

//create the edge RD top down relationship
  val EdgesRDD =  edgeDF.rdd.map { x => (x.get(0), x.get(1)) }
                 .map { x => Edge(MurmurHash3.stringHash(x._1.toString).toLong, MurmurHash3.stringHash(x._2.toString).toLong, "topdown") }

// create the edge RD top down relationship
 val graph = Graph(verticesRDD, EdgesRDD).cache()
  //val pathSeperator = """/"""

//initialize id,level,root,path,iscyclic, isleaf
  val initialMsg = (0L, 0, 0.asInstanceOf[Any], List("dummy"), 0, 1)
  val initialGraph = graph.mapVertices((id, v) => (id, 0, v._2, List(v._3), 0, v._3, 1, v._1))
  val hrchyRDD = initialGraph.pregel(initialMsg, Int.MaxValue, EdgeDirection.Out)(setMsg, sendMsg, mergeMsg)
//build the path from the list
  val hrchyOutRDD = hrchyRDD.vertices.map { case (id, v) => (v._8, (v._2, v._3, pathSeperator + v._4.reverse.mkString(pathSeperator), v._5, v._7)) }
  hrchyOutRDD
}

I was able to narrow down the line that is causing an error:

val hrchyRDD = initialGraph.pregel(initialMsg, Int.MaxValue, EdgeDirection.Out)(setMsg, sendMsg, mergeMsg)

@XavierGuihot You mean how do I submit my job on cluster? I add it as step using spark submit. — Monika Patel, Mar 22 '18 at 18:40

score 0 · Answer 1 · answered Jul 23 '18 at 19:27

I had this exact same issue happening to me, where I was able to run it on spark-shell failing when executed from spark-submit. Here’s an example of the code I was trying to execute (looks like it's the same as yours)

The error that pointed me to the right solution was:

org.apache.spark.SparkException: A master URL must be set in your configuration

In my case, I was getting that error because I had defined the SparkContext outside the main function:

object Test {        
     val sc = SparkContext.getOrCreate
     val sqlContext = new SQLContext(sc)
     def main(args: Array[String]) {
                 ...
     }
}

I was able to solve it by moving SparkContext and sqlContext inside the main function as described in this other post

Spark Graphx : class not found error on EMR cluster

1 Answers1