Converting rdd of numpy arrays to pyspark dataframe using Vectors

Question

When converting a RDD of numpy arrays to a Spark DF using DenseVector, the following piece of code works absolutely fine.

import numpy as np
from pyspark.ml.linalg import DenseVector

rdd = spark.sparkContext.parallelize([
        np.array([u'5.0', u'0.0', u'0.0', u'0.0', u'1.0']),
        np.array([u'6.0', u'0.0', u'0.0', u'0.0', u'1.0'])
    ])

(rdd.map(lambda x: (x[0].tolist(), DenseVector(x[1:])))
       .toDF()
       .show(2, False))

# +---+---------------------+
# | _1|                   _2|
# +---+---------------------+
# |5.0|[5.0,0.0,0.0,0.0,1.0]|
# |6.0|[6.0,0.0,0.0,0.0,1.0]|
# +---+---------------------+

However, I do not want the first column from above, i.e. my target output is:

# +---------------------+
# |                   _1|
# +---------------------+
# |[5.0,0.0,0.0,0.0,1.0]|
# |[6.0,0.0,0.0,0.0,1.0]|
# +---------------------+

I tried the following operations, and they all resulted in a TypeError: not supported type: <type 'numpy.ndarray'>. How would one get the above desired result? Any help is greatly appreciated!

rdd.map(lambda x: DenseVector(x[0:])).toDF()    \\only DenseVector
rdd.map(lambda x: (DenseVector(x[0:]))).toDF()   \\with parenthesis
rdd.map(lambda x: DenseVector(x[1:])).toDF()   \\only from element1

Try converting to a tuple with `rdd.map(lambda x: (DenseVector(x[0:]),)).toDF()` and see if it works. — Shaido, Apr 12 '19 at 02:51
@Shaido That worked! Thanks! Why does that work though? (feel free to add your comment as an answer so I can mark it so) — aaron02, Apr 12 '19 at 03:04

Converting rdd of numpy arrays to pyspark dataframe using Vectors

0 Answers0