Spark 2.0.0+
At first glance all Transformers
and Estimators
implement MLWritable
. If you use Spark <= 1.6.0 and experience some issues with model saving I would suggest switching version.
Spark >= 1.6
Since Spark 1.6 it's possible to save your models using the save method. Because almost every model implements the MLWritable interface. For example LogisticRegressionModel, has it, and therefore it's possible to save your model to the desired path using it.
Spark < 1.6
Some operations on a DataFrames
can be optimized and it translates to improved performance compared to plain RDDs
. DataFrames
provide efficient caching and SQLish API is arguably easier to comprehend than RDD API.
ML Pipelinesare extremely useful and tools like cross-validator or different
evaluators are simply must-have in any machine pipeline and even if none of the above is particularly hard do implement on top of low level MLlib API
it is much better to have ready to use, universal and relatively well tested solution.
I believe that at the end of the day what you get by using ML
over MLLib
is quite elegant, high level API. One thing you can do is to combine both to create a custom multi-step pipeline:
- use ML to load, clean and transform data,
- extract required data (see for example [extractLabeledPoints ]4 method) and pass to MLLib algorithm,
- add custom cross-validation / evaluation
- save MLLib model using a method of your choice (Spark model or PMML)
In Jira
also there is temporary solution provided . Temporary Solution