3

I have the following dataframe with array of doubles that need to be converted to Vectors in order to pass it to an ML algorithm. Can anyone help me with this?

fList: org.apache.spark.sql.DataFrame = [features: array<double>]
+--------------------------------------------------------------------------------+
|features                                                                        |
+--------------------------------------------------------------------------------+
|[2.5046410000000003, 2.1487149999999997, 1.0884870000000002, 3.5877090000000003]|
|[0.9558040000000001, 0.9843780000000002, 0.545025, 0.9979860000000002]          |
+--------------------------------------------------------------------------------+

Expected Output: Should look something like this.

fList: org.apache.spark.sql.DataFrame = [features: vector]
Shaido
  • 27,497
  • 23
  • 70
  • 73
Naveen
  • 45
  • 7
  • Take a look at this other question, maybe it will help: https://stackoverflow.com/questions/42138482/pyspark-how-do-i-convert-an-array-i-e-list-column-to-vector – xmorera Nov 29 '17 at 01:50

1 Answers1

1

I would suggest you to write a udf function

import org.apache.spark.sql.functions._
import org.apache.spark.mllib.linalg.Vectors
def convertArrayToVector = udf((features: mutable.WrappedArray[Double]) => Vectors.dense(features.toArray))

and call that function in withColumn api

scala> df.withColumn("features", convertArrayToVector($"features"))
res1: org.apache.spark.sql.DataFrame = [features: vector]

I hope the answer is helpful

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • getting the error :62: error: not found: value mutable def convertArrayToVector = udf((features: mutable.WrappedArray[Double]) => Vectors.dense(features)) – Naveen Nov 29 '17 at 06:42
  • you have to import scala.collection._ – Ramesh Maharjan Nov 29 '17 at 07:00
  • Ramesh...I added the import statement which resolved that issue but facing a new one. Any idea on what this error is about ................import scala.collection._ :55: error: overloaded method value dense with alternatives: (values: Array[Double])org.apache.spark.mllib.linalg.Vector (firstValue: Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector cannot be applied to (scala.collection.mutable.WrappedArray[Double]) def convertArrayToVector = udf((features: mutable.WrappedArray[Double]) => Vectors.dense(features)) – Naveen Nov 29 '17 at 18:16
  • You have to convert the WrappedArray to Array . See in my answer that I did `Vectors.dense(features.toArray)` .toArray is necessary – Ramesh Maharjan Nov 29 '17 at 23:48