How can I get a specific value out of a Spark DenseVector that is stored in a DataFrame column into a new column in the same DataFrame without using a python user defined function (udf)?
More generally, how can I perform operations on vectors stored in a DataFrame column and put the results in a new column in the same DataFrame
The following should be reproducible.
spark = pyspark.sql.SparkSession.builder.getOrCreate()
from pyspark.mllib.linalg import DenseVector
import pyspark.sql.types as T
testdf = spark.createDataFrame([\
(DenseVector([2, 3]),),\
(DenseVector([4, 5]),),\
(DenseVector([6, 7]),)],\
['DenseVectors'])
These work for single extractions.
testdf.collect()[0][0][1]
3.0
testdf.collect()[0][0].dot(DenseVector([0, 1]))
3.0
But I cannot get those to work to create a new column.
testdf \
.withColumn('test', testdf.DenseVectors[0][0][1]) \
> AnalysisException: u"Can't extract value from DenseVectors#211: need struct type but got vector;"
testdf \
.withColumn('test', testdf.DenseVectors.dot(DenseVector([0, 1]))) \
> TypeError: 'Column' object is not callable