Apache spark project with single executable JAR with DataNucleus

Question

I'm trying to run a Java project that uses Apache Spark and Java. The project is cloned from git: https://github.com/ONSdigital/address-index-data. I am new to both Spark and Java, which isn't helping me. I can't quite get to the solution using answers to similar questions , e.g. here

If I run the code, as is, from IntelliJ (with correct local Elasticsearch settings in application.conf), then everything works fine - IntelliJ seems to download the required jar files and link them at run time. However, I need to configure the project such that I can run it from the command line. This seems to be a known issue listed in the github project, with no solution offered.

If I run

sbt clean assembly

as in the instructions, it successfully makes a complete JAR file. However, then using

java -Dconfig.file=application.conf -jar batch/target/scala-2.11/ons-ai-batch-assembly-version.jar

this happens:

20/06/16 17:06:41 WARN Utils: Your hostname, MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.1.163 instead (on interface en0)
20/06/16 17:06:41 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/06/16 17:06:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/06/16 17:06:44 WARN Persistence: Error creating validator of type org.datanucleus.properties.CorePropertyValidator
ClassLoaderResolver for class "" gave error on creation : {1}
org.datanucleus.exceptions.NucleusUserException: ClassLoaderResolver for class "" gave error on creation : {1}
        at org.datanucleus.NucleusContext.getClassLoaderResolver(NucleusContext.java:1087)
        at org.datanucleus.PersistenceConfiguration.validatePropertyValue(PersistenceConfiguration.java:797)
        at org.datanucleus.PersistenceConfiguration.setProperty(PersistenceConfiguration.java:714)

From previous posts, e.g. , I think this is because sbt is merging the jar files and information is lost. However, I cannot see how to either

Merge correctly, or
Collate all the JAR files necessary (including Scala libraries) with a build script that builds the classpath and executes the JAR file with a java command.

How can I proceed? Please keep instructions explicit, as I am really unsure about xml configs etc. And thanks!

Lisa Clark · Answer 1 · 2020-08-06T16:46:20.860

So after a long time hitting my head against a wall, I finally managed to solve this one. The answer is mostly in two other stackoverflow solutions (here and here) (huge thanks to those authors!) but I'll add more detail as I still needed more pointers.

As Oscar Korz said, the problem is that "the DataNucleus core tries to load modules as OSGi bundles, even when it is not running in an OSGi container. This works fine as long as the jars are not merged", which I need to do. So, when running "sbt clean assembly", the merged jar wrongly merged the datanucleus plugin files and didn't add the additional OSGi part in MANIFEST.MF.

I will give explicit details (and some tips) as to how I fixed the "fat jar".

To get the bulk of the "fat jar", I run

sbt clean assembly

but I made sure that I had also added the plugin.xml within assemblyMergeStrategy in build.sbt (using first or last, so we keep the plugin.xml):

assemblyMergeStrategy in assembly := {
    ...
    case "plugin.xml" => MergeStrategy.first
    ...
  }

This gives a "fat jar" (that still won't work) in the batch/target/scala-XXX folder, where XXX is the scala version used.

Copy the resulting jar tar file into a separate directory and then unpack it using:

jar xvf your-jar-assembly-0.1.jar

Within the unpacked folder, Edit the META-INF/MANIFEST.MF file by adding to the end:

Bundle-SymbolicName: org.datanucleus;singleton:=true

Premain-Class: org.datanucleus.enhancer.DataNucleusClassFileTransformer
Now we need to fix the plugin.xml by merging the 3 datanucleus files. Find and then unpack the original datanucleus jar files (as above) and separate out each plugin.xml (they are different). Anebril's solution in the stackoverflow solution gives a good start to merge these three files. But I will add a tip to help:

Pipe the contents from the 3 datanucleus files using this command and this will tell you where there are extensions which need merging:

cat plugin_core.xml plugin_rdbms.xml plugin_api.xml | grep -h "extension point" | tr -d "[:blank:]"| sort | uniq -d

You will still need to manually manage the merge of the elements highlighted as duplicates.

Within the unpacked your-jar-assembly-0.1.jar folder, replace the original plugin.xml with your newly merged plugin.xml.
Tar the jar file up again (but include the manifest!)

jar cmvf META-INF/MANIFEST.MF your-jar-assembly-0.1.jar *

Copy this jar file back into the batch/target/scala-XXX folder (replacing the original).

You can then use

java -Dconfig.file=application.conf -jar batch/target/scala-2.XXX/your-jar-assembly-0.1.jar

to run the fat jar. Voila!

I followed these steps and got the following error - Identifier Factory with name "datanucleus1" is not registered! Please check your CLASSPATH for presence of the plugin containing this factory, and your PMF settings for identifier factory Any idea how this can be resolved? I have looked all over but with no luck! — sowmyaa guptaa, Jan 28 '21 at 16:11
I've also done a quick search and can't find much of use. Perhaps check your CLASSPATH and see which you have registered. Maybe this page might help: https://www.datanucleus.org/products/accessplatform/jdo/persistence.html. — Lisa Clark, Jan 29 '21 at 19:19

Apache spark project with single executable JAR with DataNucleus

1 Answers1

Linked