1

I am trying to import a POJO model into Sparkling Water. I am currently importing the model by compiling it using:

javac -cp /opt/bitnami/commons/pojo.jar -J-Xmx2g -J-XX:MaxPermSize=256m /opt/bitnami/commons/GBM_model_python_1642760589977_1.java

And after this, I load it using hex.genmodel.GenModel, something like this:

val classLocation = new File("/opt/bitnami/commons/").toURL
valLocation = Array[java.net.URL](classLocation)
val classLoader = new URLClassLoader(Location,classOf[GenModel].getClassLoader)
val cls = Class.forName("GBM_model_python_1642760589977_1", true, classLoader)
val model: GenModel = cls.newInstance().asInstanceOf[GenModel]

The problem is when making predictions I have problems with URLClassLoader:

val easyModel = new EasyPredictModelWrapper(model)
classLoader.close()
val header = model.getNames 
val outputType = easyModel.getModelCategory 

val predictionRdd = testData.rdd.map(row => {
  val r = new RowData
  header.indices.foreach(idx => r.put(header(idx), row.getDouble(idx).asInstanceOf[AnyRef]))
  val prediction = easyModel.predictMultinomial(r)
  prediction
  })

Throwing the exception:

org.apache.spark.SparkException: Task not serializable
Caused by: java.io.NotSerializableException: java.net.URLClassLoader
Serialization stack:

I dont know why since i think URLClassLoader isnt in use. I tried to use classLoader.close() to solve it but it didnt work.

My questions are: Is there an easier way to import POJO models into Sparkling Water? If so, and this is the ideal way, right now I am compiling the model locally but I need to save them in S3... Is there any way to load the model without having to compile it locally like saving it in memory or something? How can I fix the serialization issue?

1 Answers1

0

You are trying to compile the POJO at runtime and then distribute it to the executors. For some reason - your code is also distributing the URLClassloader (that has nothing to do with the POJO, the POJO itself is Serializable) - the URLClassloader is not serializable.

I think this approach cannot work in general because if you compile the class on the driver and serialize it - it will not be available on the executors.

A better approach would be to put the class on the classpath among other Spark jars when the job is submitted.

Michal Kurka
  • 566
  • 2
  • 6