When converting a RDD of numpy arrays to a Spark DF using DenseVector
, the following piece of code works absolutely fine.
import numpy as np
from pyspark.ml.linalg import DenseVector
rdd = spark.sparkContext.parallelize([
np.array([u'5.0', u'0.0', u'0.0', u'0.0', u'1.0']),
np.array([u'6.0', u'0.0', u'0.0', u'0.0', u'1.0'])
])
(rdd.map(lambda x: (x[0].tolist(), DenseVector(x[1:])))
.toDF()
.show(2, False))
# +---+---------------------+
# | _1| _2|
# +---+---------------------+
# |5.0|[5.0,0.0,0.0,0.0,1.0]|
# |6.0|[6.0,0.0,0.0,0.0,1.0]|
# +---+---------------------+
However, I do not want the first column from above, i.e. my target output is:
# +---------------------+
# | _1|
# +---------------------+
# |[5.0,0.0,0.0,0.0,1.0]|
# |[6.0,0.0,0.0,0.0,1.0]|
# +---------------------+
I tried the following operations, and they all resulted in a TypeError: not supported type: <type 'numpy.ndarray'>
. How would one get the above desired result? Any help is greatly appreciated!
rdd.map(lambda x: DenseVector(x[0:])).toDF() \\only DenseVector
rdd.map(lambda x: (DenseVector(x[0:]))).toDF() \\with parenthesis
rdd.map(lambda x: DenseVector(x[1:])).toDF() \\only from element1