I am testing the following workflow:
- Building a model from a huge set of data. (Python)
- Using the model to perform estimates in a production server. (Scala)
I am using a Pipeline with a VectorIndexer followed by a GBTRegressor. I have 5 input columns (For now. Eventually, we'd like to add more). I might be able to work with just the GBTRegressor, or even another model if it makes a difference.
Step 1 takes about 15 minutes on a cluster of 8 machines, which is fine. Step 2 takes about 100ms to estimate a single value. We'd like to return this as part of an API call, so 100ms is too long.
I am aware that spark is for large data sets, and that this slowness is probably due to overhead for dealing with that, but building a model from a large dataset and running the model on a large dataset seems like a common use-case. I can use something designed for smaller datasets, but then I will have trouble building my model from a large dataset.
Is there some sort of workaround for this? I'd like to stick with spark, but is there any way to perform the second operation substantially faster? Am I missing something?
Here are some excerpts from the part of my code that runs slow:
val spark = SparkSession.builder()
.master("local")
.appName("Rendition Size Estimate")
.config("spark.ui.enabled",false)
.getOrCreate()
model = PipelineModel.load(r.getPath)
....
val input = RenditionSizeEstimator.spark.createDataFrame(Seq(
(0.0, Vectors.dense(x1, x2, x3, x4, x5))
)).toDF("label", "features")
val t = model.transform(input)
return t.head().getDouble(3) //column three is the prediction
Related Qs:
- Apache Spark's performance tuning
- Speedup Spark classifier on small datasets
- Spark cluster does not scale to small data
- How to serve a Spark MLlib model?
UPDATE: that last one is looking to find out how to serve predictions. I already know (one way) how, but I'm concerned about performance.