Match Dataframe Categorical Variables in vector Spark Scala

Question

I have been trying to follow the stack overflow example about creating dataframes for machine learning ml library in spark scala.

How to create correct data frame for classification in Spark ML

However, I cannot get the matching udf to work.

Syntax: "kinds of the type arguments (Vector,Int,Int,String,String) do not conform to the expected kinds of the type parameters (type RT,type A1,type A2,type A3,type A4). Vector's type parameters do not match type RT's expected parameters: type Vector has one type parameter, but type RT has none"

I need to create a dataframe to input into the logistic regression library. Source sample data example has:

Source, Amount, Account, Fraud
CACC1, 9120.50, 999, 0
CACC2, 3897.25, 999, 0
AMXCC1, -523, 999, 0
MASCC2, -8723.15, 999, 0

I suppose my desired output is:

+-------------------+-----+
|           features|label|
+-------------------+-----+
|[1.0,9120.50,999]  |  0.0|
|[1.0,3897.25,999]  |  0.0|
|[2.0,-523.00,999]  |  0.0|
|[0.0,-8723.15,999] |  0.0|
+-------------------+-----+

So far I have:

val df = sqlContext.sql("select * from prediction_test")
val df_2 = df.select("source","amount","account")

val toVec3 = udf[Vector,String,Int,Int] { (a,b,c) => 
  val e3 = c match {
    case "MASCC2" => 0
    case "CACC1" => 1
    case "AMXCC1" => 2
  }
  Vectors.dense(e1, b, c) 
}

val encodeLabel = udf[Double, Int](_match{case "0" => 0.0 case "1" => 1.0})

val df_3 = df_2.withColumn("features", toVec3(df_2("source"),df_2("amount"),df_2("account")).withColumn("label", encodeLabel(df("fraud"))).select("features","label")

How to create correct data frame for classification in Spark ML

You are defining your udf with 3 input column parameters and given it 4 — eliasah, Jun 14 '16 at 15:29

score 0 · Answer 1 · answered Aug 20 '18 at 04:39

By using Spark 2.3.1 I suggest following codes for classification ready Spark ML Pipeline. If you want to include classification object into Pipeline you need to just add it where I point out. ClassificationPipeline returns a PipelineModel. Once you transform this model you can get a classification ready columns named features and label.

// Handles categorical features
 def stringIndexerPipeline(inputCol: String): (Pipeline, String) = {
      val indexer = new StringIndexer()
        .setHandleInvalid("skip")
        .setInputCol(inputCol)
        .setOutputCol(inputCol + "_indexed")
      val pipeline = new Pipeline().setStages(Array(indexer))
      (pipeline, inputCol + "_indexed")
    }

// Classification Pipeline Function
def ClassificationPipeline(df:DataFrame): PipelineModel = {

  // Preprocessing categorical features
  val (SourcePipeline, Source_indexed) = stringIndexerPipeline("Source")

  // Use StringIndexer output as input for OneHotEncoderEstimator
  val oneHotEncoder = new OneHotEncoderEstimator()
    //.setDropLast(true)
    //.setHandleInvalid("skip")
    .setInputCols(Array("Source_indexed"))
    .setOutputCols(Array("Source_indexedVec"))


  // Gather features that will be pass through pipeline
  val inputCols = oneHotEncoder.getOutputCols ++ Array("Amount","Account")

  // Put all inputs in a column as a vector
  val vectorAssembler = new VectorAssembler()
    .setInputCols(inputCols)
    .setOutputCol("featureVector")

  // Scale vector column
  val standartScaler = new StandardScaler()
    .setInputCol("featureVector")
    .setOutputCol("features")
    .setWithStd(true)
    .setWithMean(false)

  // Create stringindexer for label col
  val labelIndexer = new StringIndexer().
    setHandleInvalid("skip").
    setInputCol("Fraud").
    setOutputCol("label")

  // create classification object in here 
  // val classificationObject = new ....


  // Create a pipeline
  val pipeline = new Pipeline().setStages(
    Array(SourcePipeline, oneHotEncoder, vectorAssembler, standartScaler, labelIndexer/*, classificationObject*/))
  pipeline.fit(df)



   }

val pipelineModel = ClassificationPipeline(df)

val transformedDF = pipelineModel.transform(df)

Match Dataframe Categorical Variables in vector Spark Scala

1 Answers1