0

Having a perplexing issue when I run a spark application via a deployed jar (built by maven shade plugin) in non-local environments.

java.lang.RuntimeException: java.lang.ClassNotFoundException: org.postgresql.ds.PGSimpleDataSource
    at com.zaxxer.hikari.util.UtilityElf.createInstance(UtilityElf.java:96)
    at com.zaxxer.hikari.pool.PoolBase.initializeDataSource(PoolBase.java:314)
    at com.zaxxer.hikari.pool.PoolBase.<init>(PoolBase.java:108)
    at com.zaxxer.hikari.pool.HikariPool.<init>(HikariPool.java:105)
    at com.zaxxer.hikari.HikariDataSource.<init>(HikariDataSource.java:72)
    at mypackage.SansORMProvider.get(SansORMProvider.java:42)
    at mypackage.MySansORMProvider.get(MySansORMProvider.scala:15)
    at mypackage.MyApp$.main(MyApp.scala:63)
    at mypackage.MyApp.main(MyApp.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:680)
Caused by: java.lang.ClassNotFoundException: org.postgresql.ds.PGSimpleDataSource
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
    at com.zaxxer.hikari.util.UtilityElf.createInstance(UtilityElf.java:83)
    ... 13 more

The reason this is perplexing is because the following is in my pom.xml:

<dependency>
    <groupId>org.postgresql</groupId>
    <artifactId>postgresql</artifactId>
    <scope>compile</scope>
</dependency>

The shade plugin has no configurations referencing this postgres dependency or any pattern that would match it.

<plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-shade-plugin</artifactId>
                    <executions>
                        <execution>
                            <goals>
                                <goal>shade</goal>
                            </goals>
                            <phase>package</phase>
                            <configuration>
                                <artifactSet>
                                    <excludes combine.children="append">
                                        <exclude>org.apache.spark:*:*</exclude>
                                        <exclude>org.apache.hadoop:*:*</exclude>
                                        <exclude>org.slf4j:*</exclude>
                                    </excludes>
                                </artifactSet>
                                <filters>
                                    <filter>
                                        <artifact>*:*</artifact>
                                        <excludes>
                                            <exclude>META-INF/*.SF</exclude>
                                            <exclude>META-INF/*.DSA</exclude>
                                            <exclude>META-INF/*.RSA</exclude>
                                        </excludes>
                                    </filter>
                                </filters>
                                <relocations>
                                    <relocation>
                                        <pattern>com.google.common</pattern>
                                        <shadedPattern>${project.groupId}.google.common</shadedPattern>
                                    </relocation>
                                    <relocation>
                                        <pattern>io.netty</pattern>
                                        <shadedPattern>${project.groupId}.io.netty</shadedPattern>
                                    </relocation>
                                    <relocation>
                                        <pattern>okhttp3</pattern>
                                        <shadedPattern>${project.groupId}.okhttp3</shadedPattern>
                                    </relocation>
                                    <relocation>
                                        <pattern>com.fasterxml.jackson</pattern>
                                        <shadedPattern>${project.groupId}.fasterxml.jackson</shadedPattern>
                                    </relocation>
                                    ]
                                </relocations>
                                <shadedArtifactAttached>true</shadedArtifactAttached>
                            </configuration>
                        </execution>
                    </executions>
                </plugin>

Spark dependencies (as requested):

<dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-core_2.11</artifactId>
                <version>2.4.3</version>
                <scope>provided</scope>
            </dependency>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-mllib_2.11</artifactId>
                <version>2.4.3</version>
                <scope>provided</scope>
            </dependency>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-sql_2.11</artifactId>
                <version>2.4.3</version>
                <scope>provided</scope>
            </dependency>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-tags_2.11</artifactId>
                <version>2.4.3</version>
                <scope>provided</scope>
            </dependency>

In the output from the maven command building the jar, I can see [INFO] Including org.postgresql:postgresql:jar:42.2.1 in the shaded jar.

And when I run jar tvf myShadedJar.jar | grep postgres I can see the missing class.

One weird thing that may be relevant is when I actually unzip the jar with jar xf theres no org/postgresql folder. Yet, when i unzip the jar it's there.

What might be the problem? How do I confirm it? And is it expected that the exploded jar is missing the org/postgresql folder?

b15
  • 2,101
  • 3
  • 28
  • 46
  • Do you use Spring Boot or any other framework doing custom class loading? – dan1st Aug 30 '21 at 14:33
  • Are you sure you wanted to write `java tvf myShadedJar.jar | grep postgres` and not `tar tvf myShadedJar.jar | grep postgres`? – dan1st Aug 30 '21 at 14:34
  • @dan1st other modules in the project use google guice (not sure if that even does custom class loading) but not this one. – b15 Aug 30 '21 at 14:35
  • Was meant to be `jar` not `java`. Fixing – b15 Aug 30 '21 at 14:36
  • Where does "`User class threw exception:`" come from, i.e. what wraps the `ClassNotFoundException` in a `RuntimeException`? Since this is not a message of ordinary Java exceptions, which usually look like "`Exception in thread "..." ...`". Perhaps a few more stacktrace lines would give valuable information too. Please add the POM declarations of the `maven-shade-plugin` to your question. – Gerold Broser Aug 30 '21 at 14:51
  • 1
    Let's see the full pom configuration of the maven-shade-plugin please. BTW @GeroldBroser I believe the exception is wrapped by Apache Spark. See here: https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/missing_dependencies_in_jar_files.html – Adriaan Koster Aug 30 '21 at 14:55
  • It is a spark application, yes. @GeroldBroser I've added what you asked for. Let me know if I'm missing anything else. – b15 Aug 30 '21 at 14:57
  • Is the spurious ] in the config also in your project? – Adriaan Koster Aug 30 '21 at 15:00
  • It is. I'll remove and rebuild. Is it expected that there is no org/postgresql folder when I extract the jar? – b15 Aug 30 '21 at 15:01
  • I would expect that to be there, if the shade plugin unzips all .jar files and puts them in one big directory structure. Another question: which versions of Maven and the shade plugin are you using? – Adriaan Koster Aug 30 '21 at 15:04
  • This is the only shade configuration in your and all your parent POMs, is it? Because "[_Why Does My Second Shade Include The Results Of The First Execution?_](https://maven.apache.org/plugins/maven-shade-plugin/faq.html)". – Gerold Broser Aug 30 '21 at 15:11
  • Do your Spark declarations look like in [the page _Adriaan Koster_ linked](https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/missing_dependencies_in_jar_files.html) in one if his comments? I.e. do they have `provided`? – Gerold Broser Aug 30 '21 at 15:39
  • @GeroldBroser The dependency is the one in the question above. It is not a spark dependency and has scope compiled – b15 Aug 30 '21 at 15:40
  • Another note: if I `unzip my.jar` I do see the class file. But `jar xf` doesn't produce it. Why is that? – b15 Aug 30 '21 at 15:42
  • According to your now updated stacktrace it begins at `org.apache.spark.deploy.yarn.ApplicationMaster:680`. Please add the `org.apache.spark` dependencies of your POM to the question, so we can see which versions you're using. – Gerold Broser Aug 30 '21 at 16:32
  • @GeroldBroser Updated with the spark dependencies. Version is 2.4.3 – b15 Aug 30 '21 at 17:43
  • 1
    [Your last update](https://stackoverflow.com/posts/68985780/revisions#rev-body-a6ae05ae-ad02-42fd-b1b2-6ca9aa5413c2) leading to "_when I actually unzip the jar with `jar xf` theres no org/postgresql folder. Yet, when i `unzip` the jar it's there._" changes things. If you supply the complete POM (with possible sensitive information anonymized) I can create a project here to see whether I can confirm this behaviour and maybe find the reason for it. I'd need the JDK and Maven version you use, too. – Gerold Broser Aug 30 '21 at 23:44
  • @GeroldBroser I'm trying to reproduce an isolated version of this so I don't have to give you the very complex, many-pom structure I have. In the meantime, I noticed the `jar xf` command is breaking on `java.io.IOException: license : could not create directory at sun.tools.jar.Main.extractFile(Main.java:1045) at sun.tools.jar.Main.extract(Main.java:981) at sun.tools.jar.Main.run(Main.java:311) at sun.tools.jar.Main.main(Main.java:1288)` – b15 Aug 31 '21 at 14:52
  • That would seem a permission problem on the file system. Please make sure the process you are starting has write permission on the directory you start it in – Adriaan Koster Aug 31 '21 at 14:54
  • Could it be there is already a Postgres dependency provided by the Spark ecosystem, which clashes with your own? – Adriaan Koster Aug 31 '21 at 15:03
  • @AdriaanKoster Good idea! If the one that is provided by the Spark runtime system comes earlier in the classpath _and_ is a version that doesn't have the class in it... see also [this answer](https://stackoverflow.com/a/6644467/1744774): "_Resources 'earlier' on the classpath take precedence over resources that are specified after them._". Still remaining question is the `jar x` vs. `unzip` behaviour. But if the former is the reason of the `ClassNotFoundException` issue that may be unrelated. – Gerold Broser Aug 31 '21 at 15:46
  • Another interesting tidbit. The following shows the 'missing' class is present the line directly before the exception: `val cl = getClass.getClassLoader val classesInPackage = ClassPath.from(cl).getTopLevelClassesRecursive("org.postgresql").toArray().mkString(", ")` – b15 Aug 31 '21 at 18:39
  • What's on the line causing the exception then? at mypackage.SansORMProvider.get(SansORMProvider.java:42) – Adriaan Koster Sep 01 '21 at 14:35

2 Answers2

3

I struggled with exactly same issue migrating our spark application from AWS EMR 5.24.1 to 5.33.0. After several weeks of regular attempts to find a way out I realized eventually that HikariCP package from our uber-jar is not used during the execution. It became evident when I excluded HikariCP from the fat jar and the error didn't change though I expected that it will complain that HikariCP is not found.
It turned out that EMR 5.33.0 has HikariCP-java7-2.4.12.jar in multiple lib folders and this package is being used during runtime (instead of the one in the uber jar). I merely removed all occasions of this package from both master and core nodes and this fixed the problem.

I hope this helps to somebody who is frustrating with similar issue.

newfeya
  • 31
  • 3
0

EDIT: Based on new info from OP, not likely this is the answer.

It looks like you have run into this issue: https://github.com/jeremylong/DependencyCheck/issues/2324

Unfortunately - shaded jars represent a challenge. In the cases when the dependent library that is "shaded" contains a pom.xml in the META-INF directory (i.e. was built by maven or uses the maven plugin for gradle) - then we can extract information and detect the dependency. However, in the case of commons-fileupload they do not have a pom.xml in the META-INF (not entirely sure what build system they use). As such, dependency-check will not be able to identify the dependency. The situation is unfortunately rather bleak for shaded or uber jars (actually - its worse for uber jars). Even several of the commercial products have difficulty with this.

You could copy the file explicitly from your project into the shaded jar (e.g. by putting a copy of the jar file in your src/main/resources directory).

Or you could create a sub module in your maven project which contains this jar (and others which might suffer from the same problem). Then add the sub module as a dependency and let the shade plugin include it.

Adriaan Koster
  • 15,870
  • 5
  • 45
  • 60
  • The [Spark page](https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/missing_dependencies_in_jar_files.html) you linked in one of your comments to the question declares all Spark dependencies with `provided`. It's very likely that `provided` artifacts do not have a `pom.xml` in it. I recommend to use [`install:install-file`](https://maven.apache.org/plugins/maven-install-plugin/install-file-mojo.html) or [`deploy:deploy-file`](https://maven.apache.org/plugins/maven-deploy-plugin/deploy-file-mojo.html) prior to building, i.e. creating the Fat JAR. – Gerold Broser Aug 30 '21 at 15:15
  • cont'd: 1) That's the cleanest and the _Maven way_. 2) There's no need for explicit copying then. 3) There's no need for creating an extra project then. – Gerold Broser Aug 30 '21 at 15:22
  • Sorry can you explain how this relates to my problem? Your quote is from a project called DependencyCheck. Are you saying spark uses the same mechanism and it can't see it? – b15 Aug 30 '21 at 15:35
  • I've tried the submodule method previously and it did not work. – b15 Aug 30 '21 at 15:37
  • @GeroldBroser I'm including the dependency with the compile scope. It is not a spark dependency. – b15 Aug 30 '21 at 15:39
  • @b15 You wrote in a comment to the Q: "_It is a spark application, yes._". Don't you have `org.apache.spark` dependencies in your POM then? – Gerold Broser Aug 30 '21 at 15:41
  • @GeroldBroser I do have spark dependencies in my pom and they do have the provided scope but I guess I'm confused because the dependency in question is not one of those. – b15 Aug 30 '21 at 15:42
  • @b15 Please add the complete stacktrace to your Q, so we can see where it all begins. – Gerold Broser Aug 30 '21 at 15:47
  • To determine if the problem I describe in my answer is actually there in your case, check if the dependency 'org.postgresql:postgresql:jar:42.2.1' has a pom.xml in it. If it does, then there is a different issue causing your problem. – Adriaan Koster Aug 31 '21 at 14:49
  • When i explode the org.postgresql:postgresql:jar:42.2.1 jar I do see a pom, yes. – b15 Aug 31 '21 at 18:47