I have been trying to follow the stack overflow example about creating dataframes for machine learning ml library in spark scala.
How to create correct data frame for classification in Spark ML
However, I cannot get the matching udf to work.
Syntax: "kinds of the type arguments (Vector,Int,Int,String,String) do not conform to the expected kinds of the type parameters (type RT,type A1,type A2,type A3,type A4). Vector's type parameters do not match type RT's expected parameters: type Vector has one type parameter, but type RT has none"
I need to create a dataframe to input into the logistic regression library. Source sample data example has:
Source, Amount, Account, Fraud
CACC1, 9120.50, 999, 0
CACC2, 3897.25, 999, 0
AMXCC1, -523, 999, 0
MASCC2, -8723.15, 999, 0
I suppose my desired output is:
+-------------------+-----+
| features|label|
+-------------------+-----+
|[1.0,9120.50,999] | 0.0|
|[1.0,3897.25,999] | 0.0|
|[2.0,-523.00,999] | 0.0|
|[0.0,-8723.15,999] | 0.0|
+-------------------+-----+
So far I have:
val df = sqlContext.sql("select * from prediction_test")
val df_2 = df.select("source","amount","account")
val toVec3 = udf[Vector,String,Int,Int] { (a,b,c) =>
val e3 = c match {
case "MASCC2" => 0
case "CACC1" => 1
case "AMXCC1" => 2
}
Vectors.dense(e1, b, c)
}
val encodeLabel = udf[Double, Int](_match{case "0" => 0.0 case "1" => 1.0})
val df_3 = df_2.withColumn("features", toVec3(df_2("source"),df_2("amount"),df_2("account")).withColumn("label", encodeLabel(df("fraud"))).select("features","label")
How to create correct data frame for classification in Spark ML