0

I'm fairly new to the Scala environment. I am receiving a deduplicate error while trying to assemble a Scala Spark job with the DataStax connector. I'd appreciate any advice as to what could resolve this issue.

My System:

  • Latest Scala (2.11.7) installed via brew
  • Latest Spark (2.10.5) installed via brew
  • Latest SBT (0.13.9) installed via brew
  • SBT Assembly plugin installed

My build.sbt:

name := "spark-test"

version := "0.0.1"

scalaVersion := "2.11.7"

// additional libraries
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0" %     "provided"
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0-M3"

Console:

$ sbt assembly
...
[error] 353 errors were encountered during merge
java.lang.RuntimeException: deduplicate: different file contents found in the following:
/Users/bob/.ivy2/cache/io.netty/netty-all/jars/netty-all-4.0.29.Final.jar:META-INF/io.netty.versions.properties 
... 
Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23
BobBrez
  • 723
  • 7
  • 22
  • @jbrown thanks for the pointer, it looks like using the `assemblyMergeStrategy` from [this example](https://github.com/databricks/learning-spark/blob/master/build.sbt) corrects this issue but does not resolve my overall issue – BobBrez Jan 06 '16 at 17:26
  • Well the problem is caused by the fact that 2 jars in your dependency tree contain the same file, e.g. perhaps 2 different dependencies use different versions of the same library and the conflicts mean that sbt doesn't know how to automatically resolve the issue. So that's where you decide via merge strategies. So you may still see a warning (I forget), but as long as your project compiles and works you should be alright. I'll add my merge strategy in a comment. – jbrown Jan 06 '16 at 17:44
  • Unfortunately this is marked as duplicate too eagerly. The main source of your problem is most likely that you're trying to use the Cassandra connector for Spark 1.5.0 while you're using Spark 1.6.0. There is no connector for 1.6.0 yet. Also you mention 'latest Spark (2.10.5)' but there is no version of spark like that. You probably mean Spark for Scala 2.10.5, which may also conflict with you using Scala 2.11.7. Try `scalaVersion := "2.10.5"` and `"spark-core" % "1.5.0"` and you may not have a merge issue anymore. – sgvd Jan 30 '16 at 12:38

1 Answers1

3

As I put in my comment, this is due to sbt not knowing how to handle duplicate files. This might be caused by 2 of your dependencies depending on different versions of the same library. So you need to decide what strategy to use - check the sbt assembly docs, but these are things like "keep first", "keep last", etc.

As a reference, here's my merge strategy block for a spark project with not too many dependencies:

assemblyMergeStrategy in assembly := {
  case x if x.endsWith(".class") => MergeStrategy.last
  case x if x.endsWith(".properties") => MergeStrategy.last
  case x if x.contains("/resources/") => MergeStrategy.last
  case x if x.startsWith("META-INF/mailcap") => MergeStrategy.last
  case x if x.startsWith("META-INF/mimetypes.default") => MergeStrategy.first
  case x if x.startsWith("META-INF/maven/org.slf4j/slf4j-api/pom.") => MergeStrategy.first
  case x =>
    val oldStrategy = (assemblyMergeStrategy in assembly).value
    if (oldStrategy == MergeStrategy.deduplicate)
      MergeStrategy.first
    else
      oldStrategy(x)
}

// this jar caused issues so I just exclude it completely
assemblyExcludedJars in assembly := {
  val cp = (fullClasspath in assembly).value
  cp filter {_.data.getName == "jetty-util-6.1.26.jar"}
}
jbrown
  • 7,518
  • 16
  • 69
  • 117