I'm writing a piece of Scala code that would construct a Spark-ML pipeline from config file for me. I want to be able to instantiate objects that extend Params class (i.e. PipelineStage) and a Pipeline itself.
pipeline {
class = "org.apache.spark.ml.Pipeline"
stages = ["pca", "vectorAssembler"]
vectorAssembler {
class = "org.apache.spark.ml.feature.VectorAssembler"
inputCols = ["pcacol","col1","col2","col3"]
outputCol = "features"
}
pca {
....
}
}
At the moment I instantiate the class and set parameters by calling Params#set method. I want the parser to be as general as possible and I want to be able to set parameters of any type, including privitives, arrays of primitives and arrays of objects (e.g. Pipeline#stages). The problem is I can't tell distinguish type parameters of the setter. I look at the type of required parameter and cast config's value to this type.
param match {
case p: DoubleArrayParam =>
...
case p: IntArrayParam =>
...
case p: StringArrayParam =>
...
case p: Param[Array[Params]] =>
...
case p =>
...
In runtime Param[String], Param[Array[String]] and Param[Array[Params]] are the same. For string arrays and arrays of primitives there are separate classes, DoubleArrayParam, IntArrayParam, StringArrayParam, but I can't find a way to tell array of Params from simple String, as Param[Any] matches penultimate case in the code above.
The only solution I came to is to parse Pipeline config separately, but that means I might come to other specific cases in future.