5

I'm trying to assemble a Spark application using sbt 1.0.4 with sbt-assembly 0.14.6.

The Spark application works fine when launched in IntelliJ IDEA or spark-submit, but if I run the assembled uber-jar with the command line (cmd in Windows 10):

java -Xmx1024m -jar my-app.jar

I get the following exception:

Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: jdbc. Please find packages at http://spark.apache.org/third-party-projects.html

The Spark application looks as follows.

package spark.main

import java.util.Properties    
import org.apache.spark.sql.SparkSession

object Main {

    def main(args: Array[String]) {
        val connectionProperties = new Properties()
        connectionProperties.put("user","postgres")
        connectionProperties.put("password","postgres")
        connectionProperties.put("driver", "org.postgresql.Driver")

        val testTable = "test_tbl"

        val spark = SparkSession.builder()
            .appName("Postgres Test")
            .master("local[*]")
            .config("spark.hadoop.fs.file.impl", classOf[org.apache.hadoop.fs.LocalFileSystem].getName)
            .config("spark.sql.warehouse.dir", System.getProperty("java.io.tmpdir") + "swd")
            .getOrCreate()

        val dfPg = spark.sqlContext.read.
            jdbc("jdbc:postgresql://localhost/testdb",testTable,connectionProperties)

        dfPg.show()
    }
}

The following is build.sbt.

name := "apache-spark-scala"

version := "0.1-SNAPSHOT"

scalaVersion := "2.11.8"

mainClass in Compile := Some("spark.main.Main")

libraryDependencies ++= {
    val sparkVer = "2.1.1"
    val postgreVer = "42.0.0"
    val cassandraConVer = "2.0.2"
    val configVer = "1.3.1"
    val logbackVer = "1.7.25"
    val loggingVer = "3.7.2"
    val commonsCodecVer = "1.10"
    Seq(
        "org.apache.spark" %% "spark-sql" % sparkVer,
        "org.apache.spark" %% "spark-core" % sparkVer,
        "com.datastax.spark" %% "spark-cassandra-connector" % cassandraConVer,
        "org.postgresql" % "postgresql" % postgreVer,
        "com.typesafe" % "config" % configVer,
        "commons-codec" % "commons-codec" % commonsCodecVer,
        "com.typesafe.scala-logging" %% "scala-logging" % loggingVer,
        "org.slf4j" % "slf4j-api" % logbackVer
    )
}

dependencyOverrides ++= Seq(
    "io.netty" % "netty-all" % "4.0.42.Final",
    "commons-net" % "commons-net" % "2.2",
    "com.google.guava" % "guava" % "14.0.1"
)

assemblyMergeStrategy in assembly := {
    case PathList("META-INF", xs @ _*) => MergeStrategy.discard
    case x => MergeStrategy.first
}

Does anyone has any idea, why?

[UPDATE]

Configuration taken from offical GitHub Repository did the trick:

assemblyMergeStrategy in assembly := {
  case PathList("META-INF", xs @ _*) =>
    xs map {_.toLowerCase} match {
      case ("manifest.mf" :: Nil) | ("index.list" :: Nil) | ("dependencies" :: Nil) =>
        MergeStrategy.discard
      case ps @ (x :: xs) if ps.last.endsWith(".sf") || ps.last.endsWith(".dsa") =>
          MergeStrategy.discard
      case "services" :: _ =>  MergeStrategy.filterDistinctLines
      case _ => MergeStrategy.first
    }
    case _ => MergeStrategy.first
}
korbee82
  • 107
  • 1
  • 6

1 Answers1

4

The question is almost Why does format("kafka") fail with "Failed to find data source: kafka." with uber-jar? with the differences that the other OP used Apache Maven to create an uber-jar and here it's about sbt (sbt-assembly plugin's configuration to be precise).


The short name (aka alias) of a data source, e.g. jdbc or kafka, are only available if the corresponding META-INF/services/org.apache.spark.sql.sources.DataSourceRegister registers a DataSourceRegister.

For jdbc alias to work Spark SQL uses META-INF/services/org.apache.spark.sql.sources.DataSourceRegister with the following entry (there are others):

org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider

That's what ties jdbc alias up with the data source.

And you've excluded it from an uber-jar by the following assemblyMergeStrategy.

assemblyMergeStrategy in assembly := {
    case PathList("META-INF", xs @ _*) => MergeStrategy.discard
    case x => MergeStrategy.first
}

Note case PathList("META-INF", xs @ _*) which you simply MergeStrategy.discard. That's the root cause.

Just to check that the "infrastructure" is available and you could use the jdbc data source by its fully-qualified name (not the alias), try this:

spark.read.
  format("org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider").
  load("jdbc:postgresql://localhost/testdb")

You will see other problems due to missing options like url, but...we're digressing.

A solution is to MergeStrategy.concat all META-INF/services/org.apache.spark.sql.sources.DataSourceRegister (that would create an uber-jar with all data sources, incl. the jdbc data source).

case "META-INF/services/org.apache.spark.sql.sources.DataSourceRegister" => MergeStrategy.concat
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • 1
    Thanks for the hint. Actually the default setting, taken from [here](https://github.com/sbt/sbt-assembly), did the trick. See the updated question. – korbee82 Jan 08 '18 at 16:26
  • Can you check if `case "services" :: _ => MergeStrategy.filterDistinctLines` alone would work. You're only interested in `META-INF/services/*` so that line alone should be enough. Thanks a lot for the update (and accepting my answer)! – Jacek Laskowski Jan 08 '18 at 16:47