Following on from my earlier question, Convert a Spark Vector of features into an array, I've made progress:
def extractUdf = udf((v: SDV) => v.toArray)
val temp: DataFrame = dataWithFeatures.withColumn("extracted_features", extractUdf($"features"))
temp.printSchema()
val featuresArray1: Array[Double] = temp.rdd.map(r => r.getAs[Double](0)).collect
val featuresArray2: Array[Double] = temp.rdd.map(r => r.getAs[Double](1)).collect
val featuresArray3: Array[Double] = temp.rdd.map(r => r.getAs[Double](2)).collect
val allfeatures: Array[Array[Double]] = Array(featuresArray1, featuresArray2, featuresArray3)
val flatfeatures: Array[Double] = allfeatures.flatten
This seems to give the result I want. The extractUdf
function turns feature: Vector into extracted_feature:
|-- features: vector (nullable = true)
|-- extracted_features: array (nullable = true)
| |-- element: double (containsNull = false)
However, I don't understand why my next 3 lines of code (i.e. array featuresArray1, featuresArray2, featuresArray3) are picking up extracted_features
as opposed to any other column in temp
(like features
) for example, and how to pick up the indices of the array (0,1,2) in a which directly references the number of features and is not hard-coded. Thanks for your help!