1

I'm looking for a way to use a Spark ML pipeline to predict single tuples/rows. I fit the pipeline in a previous job and exported the model via save. The pipeline contains a random forest classification model and some preprocessing (string indexer and vector indexer).

Now, I'd like to use the pipeline in an event driven settings. Creating data sets for prediction is just not feasible. I tried to extract the random forest and do the predictions directly using model.predict(vector). However, because of the preprocessing that doesn't work.

I looked for a similar single row/vector function for a complete pipeline model, but couldn't find any. It is possible to create a data frame from a single row. That is understandably super inefficient (see code below).

  • Question1: Is there another way of predicting single data items using the pipeline model?
  • Question2: If not, is there a way of creating data frames more efficiently?

Thanks in advance!

val pipelineModel = PipelineModel.load("target/pipeline.model")
val data = spark.read().format("libsvm").load("/opt/spark-2.3.2/data/mllib/sample_libsvm_data.txt")
val collected = data.collect() as Array<Row>

val schema = data.schema()
val mutableListOfOneRow = mutableListOf<Row>()

collected.map {
    mutableListOfOneRow.add(it)
    val label = it[0] as Double

    val df = spark.createDataFrame(mutableListOfOneRow, schema)
    val result = pipelineModel.transform(df).collect() as Array<Row>
    val firstRow = result[0]
    println("label $label vs prediction ${firstRow[7]}")

    if (!label.toString().equals(firstRow[7])) {
        errorCount++
    }
    counter++
    mutableListOfOneRow.clear()
}
zero323
  • 322,348
  • 103
  • 959
  • 935

0 Answers0