1

General Description: I have two projects A and B. Project A, must use the version v1 of the L library/API. Project B, must use the version v2 of the L library/API. Project A has a dependency on project B (In project A, i need to call a method contained in B).

Concrete description: Project A is actually a machine learner which has a collection of algorithms which are using an older version of spark-mllib. I want to integrate the XGBOOST-spark algorithm in project A.

The problem is that the XGBOOST api, specifically: ml.dmlc.xgboost4j.scala.spark.XGBoost.train() method, expects an RDD<org.apache.spark.ml.feature.LabeledPoint>. But the org.apache.spark.ml.feature.LabeledPoint is only available in the newer version of spark-mllib. And from project A (which uses the older version of spark-mllib), I only have acces to an org.apache.spark.mllib.regression.LabeledPoint. So I cannot directly integrate XGBOOST in project A without upgrading the spark-mllib version of project A.

Fortunately, the newer version of spark-mllib has a method of converting from the old LabeledPoint (org.apache.spark.mllib.regression.LabeledPoint) to the new LabeledPoint (org.apache.spark.ml.feature.LabeledPoint). The method is: org.apache.spark.mllib.regression.LabeledPoint.asML().

So, the question is: Is there any clever way of using that method .asML() which is available only in the newer version of spark, so that I can convert the LabeledPoint and pass it to the XGBOOST API?

I am not familiar with how the dependencies are treated by maven but I thought of something like:

Create a project B that uses the newer version of spark-mllib, and the XGBOOST-API, and in which we have a class and a method that receives the parameters (from project A), converts the old LabeledPoint to the new LabeledPoint, calls the XGBoost.train() method which generates a model, and then we pass back the model to project A. We import that class in project A (from project B), call it's method, get the model, and we continue with our business as usual.

Of course, I tried to do that. But it doesn't work. I think that's because of the fact that we can only have one version of spark-mllib in the whole dependency tree. Since the class from project B throws java.lang.NoSuchMethodError: org.apache.spark.mllib.regression.LabeledPoint.asML()Lorg/apache/spark/ml/feature/LabeledPoint; , it seems that in the whole dependency tree, we actually use the older version of spark-mllib (and that happens because the older version is closer to the root of the dependency tree). Even though in project B we use the newer version of spark-mllib, which has the asML() method available.

So, the actual question is: Is there any clever way of making this work? Without upgrading the spark-mllib version on project A? Upgrading is not a viable option. Project A is big and if I upgrade that version, I screw up just about everything.

[Update] I even tried to use a ClassLoader (URLClassLoader) in order to load the class directly from spark-mllib_2.11-2.3.0.jar and print all the available methods. Code here:

URLClassLoader clsLoader = URLClassLoader.newInstance(new URL[] { 
            new URL("file:///home/myhome/spark-mllib_2.11-2.3.0.jar") 
            });
      Class cls = clsLoader.loadClass("org.apache.spark.mllib.regression.LabeledPoint");

      Method[] m = cls.getDeclaredMethods();
      for (int i = 0; i < m.length; i++)
          System.out.println(m[i].toString());

In my .pom file of this project, if I add a dependency of:

<dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-mllib_2.11</artifactId>
        <version>2.3.0</version>
    </dependency>

The method public org.apache.spark.ml.feature.LabeledPoint org.apache.spark.mllib.regression.LabeledPoint.asML() is present the results if i use the 2.3.0 version.

But when I use the version 1.6.2 of spark-mllib, it isn't there anymore. Even though the asML() method is within the spark-mllib's jar. Which is kind of weird.

1 Answers1

1

You can achieve this by creating a shaded dependency of Project B and using it in Project A. Refer to this answer for understanding maven shading and how to use it.

  • To be more precise please use the relocation of package name of the common jar in Project B. – KrazyGautam Apr 25 '18 at 12:15
  • Thank you for your answers! Well, I tried to relocate the packages of spark mllib. As example, for org.apache.spark.ml.feature.LabeledPoint we now have org.shaded.apache.spark.ml.feature.LabeledPoint. But the problem is that now, the asML() method returns an org.shaded.apache.spark.ml.feature.LabeledPoint (since relocating, rewrites the bytecode). But the XGBoost API needs an org.apache.spark.ml.feature.LabeledPoint . Everything is screwed up. Not to mention other problems regarding dependencies. I think there's no possible way to do this... – crs12decoder May 02 '18 at 15:00