I have a spark Dataframe df
with the following schema:
root
|-- features: array (nullable = true)
| |-- element: double (containsNull = false)
I would like to create a new Dataframe where each row will be a Vector of Double
s and expecting to get the following schema:
root
|-- features: vector (nullable = true)
So far I have the following piece of code (influenced by this post: Converting Spark Dataframe(with WrappedArray) to RDD[labelPoint] in scala) but I fear something is wrong with it because it takes a very long time to compute even a reasonable amount of rows. Also, if there are too many rows the application will crash with a heap space exception.
val clustSet = df.rdd.map(r => {
val arr = r.getAs[mutable.WrappedArray[Double]]("features")
val features: Vector = Vectors.dense(arr.toArray)
features
}).map(Tuple1(_)).toDF()
I suspect that the instruction arr.toArray
is not a good Spark practice in this case. Any clarification would be very helpful.
Thank you!