I am trying to store a DenseVector into a DataFrame in a new column.
I tried the following code, but got an AttributeError
saying 'numpy.ndarray' object has no attribute '_get_object_id'
.
from pyspark.sql import functions
from pyspark.mllib.linalg import Vectors
df = spark.createDataFrame([{'name': 'Alice', 'age': 1},
{'name': 'Bob', 'age': 2}])
vec = Vectors.dense([1.0, 3.0, 2.9])
df.withColumn('vector', functions.lit(vec))
I'm hoping to store a vector per row for computation purpose. Any help is appreciated.
[Python 3.7.3, Spark version 2.4.3, via Jupyter All-Spark-Notebook]
EDIT
I tried to follow the answer here as suggested by Florian, but I could not adapt the udf to take in a custom pre-constructed vector.
conv = functions.udf(lambda x: DenseVector(x), VectorUDT())
# Same with
# conv = functions.udf(lambda x: x, VectorUDT())
df.withColumn('vector', conv(vec)).show()
I get this error :
TypeError: Invalid argument, not a string or column: [1.0,3.0,2.9] of type <class 'pyspark.mllib.linalg.DenseVector'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.