General Description: I have two projects A and B. Project A, must use the version v1 of the L library/API. Project B, must use the version v2 of the L library/API. Project A has a dependency on project B (In project A, i need to call a method contained in B).
Concrete description: Project A is actually a machine learner which has a collection of algorithms which are using an older version of spark-mllib. I want to integrate the XGBOOST-spark algorithm in project A.
The problem is that the XGBOOST api, specifically: ml.dmlc.xgboost4j.scala.spark.XGBoost.train() method, expects an RDD<org.apache.spark.ml.feature.LabeledPoint>
. But the org.apache.spark.ml.feature.LabeledPoint is only available in the newer version of spark-mllib. And from project A (which uses the older version of spark-mllib), I only have acces to an org.apache.spark.mllib.regression.LabeledPoint. So I cannot directly integrate XGBOOST in project A without upgrading the spark-mllib version of project A.
Fortunately, the newer version of spark-mllib has a method of converting from the old LabeledPoint (org.apache.spark.mllib.regression.LabeledPoint) to the new LabeledPoint (org.apache.spark.ml.feature.LabeledPoint). The method is: org.apache.spark.mllib.regression.LabeledPoint.asML().
So, the question is: Is there any clever way of using that method .asML()
which is available only in the newer version of spark, so that I can convert the LabeledPoint and pass it to the XGBOOST API?
I am not familiar with how the dependencies are treated by maven but I thought of something like:
Create a project B that uses the newer version of spark-mllib, and the XGBOOST-API, and in which we have a class and a method that receives the parameters (from project A), converts the old LabeledPoint to the new LabeledPoint, calls the XGBoost.train() method which generates a model, and then we pass back the model to project A. We import that class in project A (from project B), call it's method, get the model, and we continue with our business as usual.
Of course, I tried to do that. But it doesn't work. I think that's because of the fact that we can only have one version of spark-mllib in the whole dependency tree. Since the class from project B throws java.lang.NoSuchMethodError: org.apache.spark.mllib.regression.LabeledPoint.asML()Lorg/apache/spark/ml/feature/LabeledPoint;
, it seems that in the whole dependency tree, we actually use the older version of spark-mllib (and that happens because the older version is closer to the root of the dependency tree). Even though in project B we use the newer version of spark-mllib, which has the asML() method available.
So, the actual question is: Is there any clever way of making this work? Without upgrading the spark-mllib version on project A? Upgrading is not a viable option. Project A is big and if I upgrade that version, I screw up just about everything.
[Update] I even tried to use a ClassLoader (URLClassLoader) in order to load the class directly from spark-mllib_2.11-2.3.0.jar and print all the available methods. Code here:
URLClassLoader clsLoader = URLClassLoader.newInstance(new URL[] {
new URL("file:///home/myhome/spark-mllib_2.11-2.3.0.jar")
});
Class cls = clsLoader.loadClass("org.apache.spark.mllib.regression.LabeledPoint");
Method[] m = cls.getDeclaredMethods();
for (int i = 0; i < m.length; i++)
System.out.println(m[i].toString());
In my .pom file of this project, if I add a dependency of:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.3.0</version>
</dependency>
The method public org.apache.spark.ml.feature.LabeledPoint org.apache.spark.mllib.regression.LabeledPoint.asML()
is present the results if i use the 2.3.0 version.
But when I use the version 1.6.2 of spark-mllib, it isn't there anymore. Even though the asML() method is within the spark-mllib's jar. Which is kind of weird.