0

I am reading a table from a MySQL database in a spark project written in scala. It s my first week on it so I am really not so fit. When I am trying to run

  val clusters = KMeans.train(parsedData, numClusters, numIterations)

I am getting an error for parsedData that says:"type mismatch; found : org.apache.spark.rdd.RDD[Map[String,Any]] required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]"

My parsed data is created above like this:

 val parsedData = dataframe_mysql.map(_.getValuesMap[Any](List("name", "event","execution","info"))).collect().foreach(println)

where dataframe_mysql is the whatever is returned from sqlcontext.read.format("jdbc").option(....) function.

How am I supposed to convert my unit to fit the requirements to pass it in the train function?

According to documentation I am supposed to use something like this:

data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()

Am I supposed to transform my values to double? because when I try to run the command above my project will crash.

thank you!

Kratos
  • 1,064
  • 4
  • 20
  • 39

1 Answers1

1

Remove the trailing .collect().foreach(println). After calling collect, you no longer have an RDD - it just turns into a local collection.

Subsequently, when you call foreach it returns Unit - foreach is for doing side-effects like printing each element in a collection. etc.

Pranav Shukla
  • 2,206
  • 2
  • 17
  • 20
  • yes! I did this but it is not matched without it even: type mismatch; found : org.apache.spark.rdd.RDD[Map[String,Any]] required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] – Kratos May 30 '16 at 12:06
  • For KMeans, you need to turn all your features into Doubles and create a Vector out of it. The example in the MLLib guide splits by ' ' because the input is separated by spaces and they are numeric values which are converted using map(_.toDouble). – Pranav Shukla May 30 '16 at 12:16