0

I'm developing Apache Spark application on Scala 2.11 using SBT 1.3.10. I use IDE on my local machine without having Spark/Hadoop/Hive installed, but rather added them as SBT dependencies (Hadoop 3.1.2, Spark 2.4.5, Hive 3.1.2). My SBT is below:

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-sql" % "2.4.5",
  "org.apache.hadoop" % "hadoop-client" % "3.1.2",

  "com.fasterxml.jackson.core" % "jackson-core" % "2.9.10",
  "com.fasterxml.jackson.core" % "jackson-databind" % "2.9.10",
  "com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.9.10",


  // about these two later in the question
  "org.apache.hive" % "hive-exec" % "3.1.2",
  "org.apache.commons" % "commons-lang3" % "3.6"
)

In my application I'm reading a sample CSV file into DataFrame with provided schema:

        val init = spark.read
          .format("csv")
          .option("header", value = false)
          .schema(sampleCsvSchema)
          .load("src/main/resources/sample.csv")

        init.show(10, false)

At some moment I had to add org.apache.hive:hive-exec:3.1.2 dependency and got an exception during execution:

Illegal pattern component: XXX
java.lang.IllegalArgumentException: Illegal pattern component: XXX
    at org.apache.commons.lang3.time.FastDatePrinter.parsePattern(FastDatePrinter.java:282)
    at org.apache.commons.lang3.time.FastDatePrinter.init(FastDatePrinter.java:149)
    at org.apache.commons.lang3.time.FastDatePrinter.<init>(FastDatePrinter.java:142)
    at org.apache.commons.lang3.time.FastDateFormat.<init>(FastDateFormat.java:369)
    at org.apache.commons.lang3.time.FastDateFormat$1.createInstance(FastDateFormat.java:91)
    at org.apache.commons.lang3.time.FastDateFormat$1.createInstance(FastDateFormat.java:88)
    at org.apache.commons.lang3.time.FormatCache.getInstance(FormatCache.java:82)
    at org.apache.commons.lang3.time.FastDateFormat.getInstance(FastDateFormat.java:165)
    at org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:139)
    at org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:41)
    ...

It says that org.apache.commons.lang3.time.FastDatePrinter.parsePattern() cannot parse spark timestamp format (org.apache.spark.sql.execution.datasources.csv.CSVOptions.timestampFormat) which is by default set to "yyyy-MM-dd'T'HH:mm:ss.SSSXXX". (Please, note that my sample.csv doesn't have any timestamp data, but anyway Spark goes through this stack of precedures).

Initially, org.apache.commons.lang3.time.FastDatePrinter was added to project by org.apache.commons:commons-lang3:3.6 dependency and worked fine. However, org.apache.hive:hive-exec:3.1.2 library has added its own implementation of specified package and class, which cannot parse "XXX" (and it cannot be excluded, as it is implemented inside library itself).

enter image description here So I have a situation where 2 library dependencies which provide 2 realizations of the same package, and I need to chose a specific one of them during app execution. How this can be done?

P.S. I've found a workaround for this specific "java.lang.IllegalArgumentException: Illegal pattern component: XXX" issue, but I'm more interested in how to resolve such SBT dependencies issues in general.

Nementaarion
  • 222
  • 2
  • 13
  • Remove all libraryDependencies but `"org.apache.spark" %% "spark-sql" % "2.4.5" and start over. – Jacek Laskowski May 21 '20 at 15:41
  • @JacekLaskowski, thanks for comment. When I start over, I finally come up to the same situation. Yes, if I only leave `spark-sql` it will successfully read CSV file, as it has `commons-lang3` dependency inside. But I need `hive-exec` for my further application logic. And this `hive-exec` causes the problem. – Nementaarion May 21 '20 at 18:32
  • @JacekLaskowski and I tried to leave the question generic: how to deal with situation when 2 libraries add package and class with the same name, but different implementation – Nementaarion May 21 '20 at 18:35
  • Can you describe why you need `hive-exec` dependency? You could define [spark-hive](https://search.maven.org/artifact/org.apache.spark/spark-hive_2.11) dependency instead if that's something needed for Spark SQL. – Jacek Laskowski May 22 '20 at 08:56
  • @JacekLaskowski, yes, actually I've initially started from `org.apache.spark:spark-hive_2.11:2.4.5`, which transitively adds `org.spark-project.hive:hive-exec:1.2.1.spark2`. But in this case I had an exception with unsupported Hadoop version: `java.lang.IllegalArgumentException: Unrecognized Hadoop major version number: 3.1.2 at org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:174)`. That is the **only reason** why I overrode `hive-exec` dependency. I would be glad if you suggest how to overcome this without overriding `hive-exec` dependency. – Nementaarion May 22 '20 at 12:21
  • Why do you need this hive-exec dependency? – Jacek Laskowski May 22 '20 at 15:00
  • @JacekLaskowski, I don't really need it, it is added transitively by `com.holdenkarau:spark-testing-base:2.4.5_0.14.0` which I use for unit test. But when I run tests, "Unrecognized Hadoop major version number: 3.1.2" appears (and I found out that it goes from `org.spark-project.hive:hive-exec:1.2.1.spark2`). Basically, I don't use/need Hive and hive-exec in my application. – Nementaarion May 22 '20 at 15:21
  • OK. Makes more sense now. Do you enableHiveSupport while building a SparkSession? That'd be the only case I can think of where hive-exec would be needed. It'd be nice to have a test to reproduce the issue. Holden would love it! :) – Jacek Laskowski May 23 '20 at 10:59
  • @JacekLaskowski, in my test I don't build a SparkSession manually, but rather use predefined one from Holden library by extending _DatasetSuiteBase_ trait. I've gone through the code of _DataFrameSuiteBaseLike_ (where _spark_ comes from) and found that Holden enables Hive [by default](https://github.com/holdenk/spark-testing-base/blob/master/core/src/main/2.0/scala/com/holdenkarau/spark/testing/DataFrameSuiteBase.scala#L62). What would you recommend? Currently it seems impossible to use Holden and `org.spark-project.hive:hive-exec` with Hadoop 3 – Nementaarion May 23 '20 at 13:20
  • Could you report it as an issue in the project's repo? Since you don't need hive (yet the spark-testing-base enables it) why are you saying that _"it seems impossible to use it with Hadoop 3"_? Is this needed for tests only? Why? – Jacek Laskowski May 23 '20 at 13:32
  • @JacekLaskowski, I meant that as soon as _spark-testing-base_ enables Hive by default (even though I don't need it), it always throws "Unrecognized Hadoop major version number" when I have Hadoop 3 in the project. So it's more like "impossible to use _spark-testing-base_ with Hadoop 3". Anyway, thanks a lot for support, I will report an issue. Returning to question: is there a way in SBT to chose specific version/implementation of class when this class is added by 2 different libraries? Would appreciate any ideas. – Nementaarion May 23 '20 at 14:52

1 Answers1

0

In situation of conflicting version dependency, I usually-

  1. Exclude certain transitive dependencies of a dependency ref - https://www.scala-sbt.org/1.x/docs/Library-Management.html#Exclude+Transitive+Dependencies
  2. For binary conflicts like mentioned in above query, I use dependencyOverrides and override the version I want. ref- https://www.scala-sbt.org/1.x/docs/Library-Management.html#Overriding+a+version
  3. rarely, if the problem doesn't solve with above 2 options, then i'll rebuild the my own version with the compatible transitive dependency

PS. Please look out for other flavours of hive exec (if any) which will save you from such situations

Som
  • 6,193
  • 1
  • 11
  • 22
  • Thanks for reply! However, these options didn't help. 1. _exclude_ doesn't help here, as there is nothing to exclude, `org.apache.commons.lang3.time.FastDatePrinter` is a direct member of `org.apache.hive:hive-exec:3.1.2` 2. I've already tried to put `org.apache.commons:commons-lang3:3.6` into _dependencyOverrides_, but again, it didn't help, the exception is still raised. And that seems right, as _dependencyOverrides_ can set certain version of one library, but I have one package from different libraries – Nementaarion May 21 '20 at 15:08
  • Then i'll go for 3rd option. Updated the answer – Som May 22 '20 at 01:07
  • sorry, could you please explain more detailed what do you mean by "rebuild my own version"? – Nementaarion May 22 '20 at 12:23
  • I mean, while using most of the open source package, the issue is generally prompt up. I usually clone the repo do some transitive dependency version changes that is needed and build the jar. May be you can do this in this case :) – Som May 22 '20 at 12:29