4

I'm trying to work with an external native library (.so file) when running a spark job. First of all I'm submitting the file using --files argument.

To load the library I'm using System.load(SparkFiles.get(libname)) after creating the SparkContext (to make sure SparkFiles are populated). Problem is that the library is only loaded by the driver node, and when tasks try to access the native methods I'm getting

WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, 13.0.0.206, executor 0): java.lang.UnsatisfiedLinkError

The only thing that worked for me was copying the .so file to all the workers before running the spark app, and creating a Scala object that would load the library before each task (can be optimized with mapPartitions).

I tried using

--conf "spark.executor.extraLibraryPath=/local/path/to/so" \
--conf "spark.driver.extraLibraryPath=/local/path/to/so"

to try to avoid that, but without success.

Now since I'm using EMR to run spark jobs, and not a consistent cluster, I would like to avoid copying files to all the nodes before running the job.

Any suggestions?

ilcord
  • 314
  • 1
  • 5
  • 15

1 Answers1

4

Solution was simpler than I thought - All I need is for the library to be loaded once per JVM

so basically what I need is to add the library file using --files and to create a Loader object:

object LibraryLoader {
    lazy val load = System.load(SparkFiles.get("libname"))
}

and use it before each task (map, filter etc.)

for example

rdd.map { x =>
    LibraryLoader.load
    // do some stuff with x
}

the laziness will ensure object will be created after SparkFiles are populated, and also single evaluation per JVM.

ilcord
  • 314
  • 1
  • 5
  • 15
  • I was trying to apply your own solution in my project. I am using DataFrame/UDFs and a UDF is calling a native function that is defined in a SO file. How can load these libraries without using a map method? – Vitrion Mar 13 '19 at 00:33
  • @Vitrion, I would load it directly from the UDF definition. – ilcord Mar 18 '19 at 15:07
  • In `System.load(SparkFiles.get("libname"))`, should you reference the library w/ the ".so" extension and should you prepend w/ the path to it? – juanchito Sep 14 '20 at 19:00