2

I'm trying to make a first attempt to access Glue Catalog from scala code.

I already had some troubles while trying Maven to be able to build my project (This helped a lot How to set up a local development environment for Scala Spark ETL to run in AWS Glue?)

But now I'm trying to run my code in an EMR cluster and I'm getting this java.lang.NoClassDefFoundError

This is my code:

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, DynamicRecord, GlueContext}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession
import org.slf4j.LoggerFactory
import org.apache.spark.sql.functions.{col, month, year}

object JoinAndRelation {

  private val logger = LoggerFactory.getLogger(getClass)

  def main(sysArgs: Array[String]): Unit = {
    //Spark session creation with connection to Glue Catalog
    implicit val spark: SparkSession = SparkSession
      .builder
      .config(new SparkConf().setAppName("TestGlueAccess"))
      .getOrCreate()
        val sc: SparkContext = spark.sparkContext
        val glueContext: GlueContext = new GlueContext(sc)
...

And this is the error:

19/02/08 15:35:26 INFO Client: 
     client token: N/A
     diagnostics: User class threw exception: java.lang.NoClassDefFoundError: com/amazonaws/services/glue/GlueContext
    at org.sergio.poc.JoinAndRelation$.main(JoinAndRelation.scala:41)
    at org.sergio.poc.JoinAndRelation.main(JoinAndRelation.scala)

I was able to compile it with Maven adding the glue-assembly.jar as a dependency, also tried to add aws-java-sdk-core aswell but it didn't work...

<dependency>
  <groupId>com.amazonaws</groupId>
  <artifactId>glue-assembly</artifactId>
  <version>1.0</version>
  <scope>system</scope>
  <systemPath>${project.basedir}/libs/glue-assembly.jar</systemPath>
</dependency>
<dependency>
  <groupId>com.amazonaws</groupId>
  <artifactId>aws-java-sdk-core</artifactId>
  <version>1.11.445</version>
</dependency>

Finally this is the command I use to run it:

spark-submit --class org.sergio.poc.JoinAndRelation --master yarn --deploy-mode cluster --executor-memory 2G --num-executors 2 MyFirstScalaMavenProject-1.0-SNAPSHOT.jar

Did anyone face the same issue?

Siodh
  • 126
  • 1
  • 8
  • You are missing the class for "com/amazonaws/services/glue/GlueContext" in your classpath. I do not know about spark-submit but have a look at [this answer](https://stackoverflow.com/questions/29099115/spark-submit-add-multiple-jars-in-classpath). I think you have to add the `glue-assembly.jar` with `--jar` – wirnse Feb 08 '19 at 22:01
  • Thanks for your anwser. I'm getting this error when I ran it with the --jars option: 19/02/12 07:31:15 INFO Client: Preparing resources for our AM container Exception in thread "main" java.lang.AbstractMethodError at org.apache.spark.internal.Logging$class.initializeLogIfNecessary(Logging.scala:99) – Siodh Feb 12 '19 at 07:35
  • Actually, it seems to work if I run it in local "--master local --deploy-mode client", but then I'm getting a glue inicialization error.... – Siodh Feb 12 '19 at 07:48

0 Answers0