0

I have a PMML model that was exported from Python and I'm using that in Spark for downstream processing. Since the jpmml Evaluator isn't serializable, I'm using it inside mapPartitions. This works fine but takes a while to complete, as the mapPartition would have to materialize the iterator and collect/build the new RDD. I'm wondering if there's a more optimal way to execute the Evaluator.

I've noticed that when Spark is executing this rdd, my CPU is under utilized (drops to ~30%). Also from the SparkUI, the TaskTime (GC Time) is Red at 53s/15s

JavaRDD<List<ClassifiedPojo>> classifiedRdd = toBeClassifiedRdd.mapPartitions( r -> {
  // initialized JPMML evaluator
  List<ClassifiedPojo> list;

  while(r.hasNext()){
    // classify
    list.add(new ClassifiedPojo())
  }

  return list.iterator();
});
webber
  • 1,834
  • 5
  • 24
  • 56
  • "Since the jpmml Evaluator isn't serializable" - are you sure? – user1808924 Jan 11 '18 at 18:59
  • Looks like the Evaluator uses something thats not serializable. "Caused by: java.io.NotSerializableException: org.xml.sax.helpers.LocatorImpl" – webber Jan 11 '18 at 20:50
  • This is SAX Locator information - it can be either 1) removed or 2) converted to serializable form - as explained in the README (and many other places). Better yet, search SO or Google for the above exception message! – user1808924 Jan 11 '18 at 21:12

1 Answers1

0

Finally! I had to do 2 things.

First, I had to fix the SAX Locator by running this:

LocatorNullifier locatorNullifier = new LocatorNullifier();
locatorNullifier.applyTo(pmml);

Second, I refactored my mapPartitions to use Streams, details here.

This gave me a big boost. Hope it helps

Community
  • 1
  • 1
webber
  • 1,834
  • 5
  • 24
  • 56