1

I want to use some of the classifiers provided by MLLib (random forests, etc) but I want to use them without connecting to a Spark cluster.

If I need to somehow run some Spark stuff in-process so that I have a Spark context to use, that's fine. But I haven't been able to find any information or an example for such a use case.

So my two questions are:

  • Is there a way to use the MLLib classifiers without a Spark context at all?
  • Otherwise, can I use them by starting a Spark context in-process, without needing any kind of actual Spark installation?
Matt Malone
  • 361
  • 4
  • 25

1 Answers1

4

org.apache.spark.mllib models:

  • Cannot be trained without Spark cluster.
  • Usually can be used for predictions without cluster, with exception to distributed models like ALS.

org.apache.spark.ml models:

There is a number of third party tools which are designed to export Spark ml models to the form which can be used in Spark agnostic environment (jpmml-spark and modeldb to enumerate a few, without special preference).

Spark mllib models have limited PMML support as well.

Commercial vendors usually provide their own tools for productionizing Spark models.

You can of course use local "cluster", but it is probably still a bit to heavy for most of possible applications. Starting a full context take at least a few seconds, and has significant memory footprint.

Also:

user8115764
  • 101
  • 2