sbt assembly failure with spark-cassandra-connector

Question

I'm fairly new to the Scala environment. I am receiving a deduplicate error while trying to assemble a Scala Spark job with the DataStax connector. I'd appreciate any advice as to what could resolve this issue.

My System:

Latest Scala (2.11.7) installed via brew
Latest Spark (2.10.5) installed via brew
Latest SBT (0.13.9) installed via brew
SBT Assembly plugin installed

My build.sbt:

name := "spark-test"

version := "0.0.1"

scalaVersion := "2.11.7"

// additional libraries
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0" %     "provided"
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0-M3"

Console:

$ sbt assembly
...
[error] 353 errors were encountered during merge
java.lang.RuntimeException: deduplicate: different file contents found in the following:
/Users/bob/.ivy2/cache/io.netty/netty-all/jars/netty-all-4.0.29.Final.jar:META-INF/io.netty.versions.properties 
...

@jbrown thanks for the pointer, it looks like using the `assemblyMergeStrategy` from [this example](https://github.com/databricks/learning-spark/blob/master/build.sbt) corrects this issue but does not resolve my overall issue — BobBrez, Jan 06 '16 at 17:26
Well the problem is caused by the fact that 2 jars in your dependency tree contain the same file, e.g. perhaps 2 different dependencies use different versions of the same library and the conflicts mean that sbt doesn't know how to automatically resolve the issue. So that's where you decide via merge strategies. So you may still see a warning (I forget), but as long as your project compiles and works you should be alright. I'll add my merge strategy in a comment. — jbrown, Jan 06 '16 at 17:44
Unfortunately this is marked as duplicate too eagerly. The main source of your problem is most likely that you're trying to use the Cassandra connector for Spark 1.5.0 while you're using Spark 1.6.0. There is no connector for 1.6.0 yet. Also you mention 'latest Spark (2.10.5)' but there is no version of spark like that. You probably mean Spark for Scala 2.10.5, which may also conflict with you using Scala 2.11.7. Try `scalaVersion := "2.10.5"` and `"spark-core" % "1.5.0"` and you may not have a merge issue anymore. — sgvd, Jan 30 '16 at 12:38

score 3 · Answer 1 · answered Jan 06 '16 at 17:48

As I put in my comment, this is due to sbt not knowing how to handle duplicate files. This might be caused by 2 of your dependencies depending on different versions of the same library. So you need to decide what strategy to use - check the sbt assembly docs, but these are things like "keep first", "keep last", etc.

As a reference, here's my merge strategy block for a spark project with not too many dependencies:

assemblyMergeStrategy in assembly := {
  case x if x.endsWith(".class") => MergeStrategy.last
  case x if x.endsWith(".properties") => MergeStrategy.last
  case x if x.contains("/resources/") => MergeStrategy.last
  case x if x.startsWith("META-INF/mailcap") => MergeStrategy.last
  case x if x.startsWith("META-INF/mimetypes.default") => MergeStrategy.first
  case x if x.startsWith("META-INF/maven/org.slf4j/slf4j-api/pom.") => MergeStrategy.first
  case x =>
    val oldStrategy = (assemblyMergeStrategy in assembly).value
    if (oldStrategy == MergeStrategy.deduplicate)
      MergeStrategy.first
    else
      oldStrategy(x)
}

// this jar caused issues so I just exclude it completely
assemblyExcludedJars in assembly := {
  val cp = (fullClasspath in assembly).value
  cp filter {_.data.getName == "jetty-util-6.1.26.jar"}
}

sbt assembly failure with spark-cassandra-connector

1 Answers1