Spark performance with a small dataset

Question

I am testing the following workflow:

Building a model from a huge set of data. (Python)
Using the model to perform estimates in a production server. (Scala)

I am using a Pipeline with a VectorIndexer followed by a GBTRegressor. I have 5 input columns (For now. Eventually, we'd like to add more). I might be able to work with just the GBTRegressor, or even another model if it makes a difference.

Step 1 takes about 15 minutes on a cluster of 8 machines, which is fine. Step 2 takes about 100ms to estimate a single value. We'd like to return this as part of an API call, so 100ms is too long.

I am aware that spark is for large data sets, and that this slowness is probably due to overhead for dealing with that, but building a model from a large dataset and running the model on a large dataset seems like a common use-case. I can use something designed for smaller datasets, but then I will have trouble building my model from a large dataset.

Is there some sort of workaround for this? I'd like to stick with spark, but is there any way to perform the second operation substantially faster? Am I missing something?

Here are some excerpts from the part of my code that runs slow:

val spark = SparkSession.builder()
    .master("local")
    .appName("Rendition Size Estimate")
    .config("spark.ui.enabled",false)
    .getOrCreate()
model = PipelineModel.load(r.getPath)

 ....

val input = RenditionSizeEstimator.spark.createDataFrame(Seq(
    (0.0, Vectors.dense(x1, x2, x3, x4, x5))
  )).toDF("label", "features")
val t = model.transform(input)
return t.head().getDouble(3) //column three is the prediction

Related Qs:

UPDATE: that last one is looking to find out how to serve predictions. I already know (one way) how, but I'm concerned about performance.

If 100ms is to much then your choosing a wrong tool. However building models and serving models are completely different concerns. There are many tools out there designed for productionizing Spark models. — zero323, Apr 11 '18 at 15:37

score 1 · Accepted Answer · answered Apr 11 '18 at 15:51

My best bet about serving a Spark PipelineModel "realtime" is MLeap.

In order to use it, you have to:

Serialize your Spark Model with MLeap utilities
Load the model in MLeap (does not require a SparkContext or any Spark dependencies)
Create your input record in JSON (not a DataFrame)
Score your record with MLeap

This works well with any Pipeline Stage already in Spark MLlib (with the exception of LDA at the time of this writing). However, things might get a bit more complicated if you are using custom Estimators/Transformers.

About Performances: MLeap FAQ

APIs relying on Spark Context can be optimized to process queries in ~100ms, and that is often too slow for many enterprise needs. For example, marketing platforms need sub-5 millisecond response times for many requests. MLeap offers execution of complex pipelines with sub-millisecond performance.

Spark performance with a small dataset

1 Answers1