I'm trying to work with an external native library (.so file) when running a spark job. First of all I'm submitting the file using --files
argument.
To load the library I'm using System.load(SparkFiles.get(libname))
after creating the SparkContext
(to make sure SparkFiles
are populated).
Problem is that the library is only loaded by the driver node, and when tasks try to access the native methods I'm getting
WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, 13.0.0.206, executor 0): java.lang.UnsatisfiedLinkError
The only thing that worked for me was copying the .so
file to all the workers before running the spark app, and creating a Scala object that would load the library before each task (can be optimized with mapPartitions
).
I tried using
--conf "spark.executor.extraLibraryPath=/local/path/to/so" \
--conf "spark.driver.extraLibraryPath=/local/path/to/so"
to try to avoid that, but without success.
Now since I'm using EMR to run spark jobs, and not a consistent cluster, I would like to avoid copying files to all the nodes before running the job.
Any suggestions?