3

This may be an old question but is still pending a solution. The entire question stemmed from a small detail in the development of Apache Spark, one of the largest open source project in history.

During the delivery and release of Spark 1.x and 2.x. A key library dependency (Apache Hive 1.x) was found to have introduced too many obsolete transitive dependencies, and prone to cause conflict if deployed with YARN/HDFS. Realising that the team won't have enough resource to enforce the mono-repo principal (namely, ensuring that each library in the dependency tree can only have 1 version), a hard fork of Apache Hive was made, compiled and published:

https://github.com/JoshRosen/hive

https://mvnrepository.com/artifact/org.spark-project.hive/hive-common/1.2.1.spark2

It's only difference with the official Apache Hive is that all source code references to the package "org.apache.hive" was replaced with "org.spark-project.hive".

This is obviously a lousy way of using another project: the new code won't keep up with the development of Apache Hive community, or mundane, repetitive works are required to keep it up to date. This also introduces dangerous exploits where an unsigned jar could be used to swap out the migrated jar (also unsigned) in an Apache Spark installation. As a result, after Spark 3.0 the migrated project was discontinued: With enough resources, the new original Apache Hive 2.x was introduced with most obsolete dependencies upgraded.

One would hope that after 5 years of the release of Apache Spark 2.0, such process should be largely automated by the improvement of all the compilation tools and plugins. Specifically, 2 plugins (maven shade plugin and gradle shadow plugin) are specifically designed for relocation of packages in dependencies, and can be used to generate the migrated bytecode of Apache Hive directly from the canonical Hive. But a quick experiment quickly revealed that none of them can accomplish such a simple task:

https://github.com/tribbloid/autoshade

This project contains 2 subprojects that only exist for repacking, one written in maven and another in gradle.

The maven subproject uses maven shade plugin to relocate json4s into repacked.test1.org.json4s:


  <dependencies>
    <dependency>
      <groupId>org.json4s</groupId>
      <artifactId>json4s-jackson_${vs.scalaBinaryV}</artifactId>
      <version>4.0.4</version>
    </dependency>
  </dependencies>

  <build>
    <plugins>

      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>3.2.4</version>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>shade</goal>
            </goals>

            <configuration>
              <!--              <createSourcesJar>true</createSourcesJar>-->

              <createDependencyReducedPom>true</createDependencyReducedPom>
              <dependencyReducedPomLocation>${project.build.directory}/dependency-reduced-pom.xml</dependencyReducedPomLocation>
              <!--              <generateUniqueDependencyReducedPom>true</generateUniqueDependencyReducedPom>-->

              <keepDependenciesWithProvidedScope>false</keepDependenciesWithProvidedScope>
              <promoteTransitiveDependencies>false</promoteTransitiveDependencies>

              <!--              <shadedClassifierName>${spark.classifier}</shadedClassifierName>-->
              <relocations>
                <relocation>
                  <pattern>org.json4s</pattern>
                  <shadedPattern>repacked.test1.org.json4s</shadedPattern>
                </relocation>
              </relocations>

              <filters>
                <filter>
                  <artifact>*:*</artifact>
                  <excludes>
                    <exclude>META-INF/*.SF</exclude>
                    <exclude>META-INF/*.DSA</exclude>
                  </excludes>
                </filter>
              </filters>
            </configuration>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>

The gradle project uses shadow plugin to relocate json4s into repacked.test2.org.json4s:

dependencies {

    api("org.json4s:json4s-jackson_${vs.scalaBinaryV}:4.0.4")
}

tasks {
    shadowJar {
        exclude("META-INF/*.SF")
        exclude("META-INF/*.DSA")

        relocate("org.json4s", "repacked.test2.org.json4s")
    }

}

After that, a third project (in gradle, but it doesn't matter) declared both as dependencies and use Scala to access the new relocated class:


dependencies {
    api(project(":repack:gradle", configuration = "shadow"))

    api("com.tribbloids.autoshade:repack-maven:0.0.1")
}

class Json4sTest {

  classOf[test1.org.json4s.Formats]

  classOf[test2.org.json4s.Formats]
}

Surprisingly, it cannot be compiled:

[Error] /home/peng/git-proto/autoshade/main/src/main/scala/com/tribbloids/spookystuff/Json4sTest.scala:7:11: Symbol 'term org.json4s' is missing from the classpath.
This symbol is required by ' <none>'.
Make sure that term json4s is in your classpath and check for conflicting dependencies with `-Ylog-classpath`.
A full rebuild may help if 'package.class' was compiled against an incompatible version of org.
[Error] /home/peng/git-proto/autoshade/main/src/main/scala/com/tribbloids/spookystuff/Json4sTest.scala:7:28: type Formats is not a member of package repacked.test1.org.json4s
[Error] /home/peng/git-proto/autoshade/main/src/main/scala/com/tribbloids/spookystuff/Json4sTest.scala:10:11: Symbol 'term org.json4s' is missing from the classpath.
This symbol is required by ' <none>'.
Make sure that term json4s is in your classpath and check for conflicting dependencies with `-Ylog-classpath`.
A full rebuild may help if 'package.class' was compiled against an incompatible version of org.
[Error] /home/peng/git-proto/autoshade/main/src/main/scala/com/tribbloids/spookystuff/Json4sTest.scala:10:28: type Formats is not a member of package repacked.test2.org.json4s

The 1st and 3rd error messages will not appear if referring to a non-existing class, it can be speculated that the package migration was incomplete and inconsistent, completely useless comparing to a manual migration of source code that the Apache Spark team did before.

So why is it so hard for such a simple task to be automated? What extra steps are required in either maven or gradle to make it work?

tribbloid
  • 4,026
  • 14
  • 64
  • 103
  • 2
    It looks like the error comes from Scala-specific metadata in the class files retaining references to their original names. It is unreasonable to expect tools built for Java only to understand this metadata and safely translate it. `sbt` (the popular Scala build system) with the plugin `sbt-assembly` can shade Scala only because the underlying classfile processor is passed a set of extra rules for handling Scala metadata. I do not know if you can add these rules to the plugins you are using. E.g. https://github.com/sbt/sbt-assembly/blob/develop/README.md#scala-libraries – HTNW Sep 21 '22 at 17:09
  • @HTNW I'll add an sbt example to verify your idea shortly, but I'm skeptical, if it is that easy it should be migrated long time ago – tribbloid Sep 22 '22 at 19:43
  • @HTNW do you have an example of an sbt project that publishes the repackaged project? The sbt-plugin appears to be unfit for this purpose (see https://github.com/sbt/sbt-assembly#publishing-not-recommended) – tribbloid Sep 23 '22 at 22:04
  • 1
    That note applies to anything capable of doing shading... `sbt-assembly` doesn't *do* anything markedly different from `maven-shade-plugin`! – HTNW Sep 23 '22 at 22:22
  • If so, then both of them are broken. It seems unlikely that modern software are built on such sand castles. – tribbloid Sep 26 '22 at 16:41
  • Wow, I've tested sbt repack and it indeed turns out to be working. So it is possible in the end, maybe maven & gradle didn't keep up? Will publish my proposal after the bounty expire – tribbloid Sep 27 '22 at 15:36
  • For a preview of my solution, see https://github.com/tribbloid/autoshade/commit/7b9f79056d283e3a9ba2fa3f17e431099133e107 – tribbloid Sep 27 '22 at 18:03

1 Answers1

1

At this moment (Oct 13 2022), the only working solution is through sbt. The following built file is used in https://github.com/tribbloid/autoshade/blob/main/repack/sbt/build.sbt, which called AssemblyPlugin to publish a shaded assembly jar:

project
  .in(file("."))
  .settings(commonSettings)
  .settings(
    scalacOptions += "-Ymacro-annotations",
    libraryDependencies ++= Seq(
      "org.json4s" %% "json4s-jackson" % "4.0.4"
    ),
    addArtifact(
      Artifact("repack-sbt", "assembly"),
      sbtassembly.AssemblyKeys.assembly
    ),
    ThisBuild / assemblyMergeStrategy := {
      case PathList("module-info.class")         => MergeStrategy.discard
      case x if x.endsWith("/module-info.class") => MergeStrategy.discard
      case x =>
        val oldStrategy = (ThisBuild / assemblyMergeStrategy).value
        oldStrategy(x)
    },
    artifact in (Compile, assembly) := {
      val art = (artifact in (Compile, assembly)).value
      art.withClassifier(Some("assembly"))
    },
    ThisBuild / assemblyJarName := {
      s"${name.value}-${scalaBinaryVersion.value}-${version.value}-assembly.jar"
    },
    ThisBuild / assemblyShadeRules := Seq(
      ShadeRule.rename("org.json4s.**" -> "repacked.test3.org.json4s.@1").inAll
    )
  )
  .enablePlugins(AssemblyPlugin)

after publishing:

sbt "clean;publishM2"
...
[success] Total time: 0 s, completed Oct. 13, 2022, 4:19:49 p.m.
[info] Wrote /home/peng/git-proto/autoshade/repack/sbt/target/scala-2.13/repack-sbt_2.13-0.0.1-SNAPSHOT.pom
[info] Strategy 'discard' was applied to 9 files (Run the task at debug level to see details)
[info] Strategy 'rename' was applied to 4 files (Run the task at debug level to see details)
[info]  published repack-sbt_2.13 to file:/home/peng/.m2/repository/com/tribbloids/autoshade/repack-sbt_2.13/0.0.1-SNAPSHOT/repack-sbt_2.13-0.0.1-SNAPSHOT-sources.jar
[info]  published repack-sbt_2.13 to file:/home/peng/.m2/repository/com/tribbloids/autoshade/repack-sbt_2.13/0.0.1-SNAPSHOT/repack-sbt_2.13-0.0.1-SNAPSHOT-javadoc.jar
[info]  published repack-sbt_2.13 to file:/home/peng/.m2/repository/com/tribbloids/autoshade/repack-sbt_2.13/0.0.1-SNAPSHOT/repack-sbt_2.13-0.0.1-SNAPSHOT.jar
[info]  published repack-sbt_2.13 to file:/home/peng/.m2/repository/com/tribbloids/autoshade/repack-sbt_2.13/0.0.1-SNAPSHOT/repack-sbt_2.13-0.0.1-SNAPSHOT.pom
[info]  published repack-sbt_2.13 to file:/home/peng/.m2/repository/com/tribbloids/autoshade/repack-sbt_2.13/0.0.1-SNAPSHOT/repack-sbt_2.13-0.0.1-SNAPSHOT-assembly.jar
[success] Total time: 3 s, completed Oct. 13, 2022, 4:19:53 p.m.
...

Any class in the assembly jar can be referred within the new repackage repacked.test3.org.json4s.

It is yet to know which part of the sbt plugin did correctly to make it possible. Once it has been figured out, the same subroutine should ideally be ported to maven-shade-plugin and gradle-shadow-plugin respectively

tribbloid
  • 4,026
  • 14
  • 64
  • 103