I have a dataframe which I created with a Pipeline object that looks like this:
df.show()
+--------------------+-----+
| features|label|
+--------------------+-----+
|[-0.0775219322931...| 0|
|[-0.0775219322931...| 0|
|[-0.0775219322931...| 0|
|[-0.0775219322931...| 0|
|[-0.0775219322931...| 0|
|[-0.0775219322931...| 0|
|[-0.0775219322931...| 0|
|[-0.0775219322931...| 0|
|[-0.0775219322931...| 0|
|[-0.0775219322931...| 0|
|[-0.0775219322931...| 0|
|[-0.0775219322931...| 0|
|[-0.0775219322931...| 0|
|[-0.0775219322931...| 0|
|[-0.0775219322931...| 0|
|[-0.0775219322931...| 0|
|[-0.0775219322931...| 0|
|[-0.0775219322931...| 0|
|[-0.0775219322931...| 0|
|[-0.0775219322931...| 0|
+--------------------+-----+
I have successfully extracted the features vectors like this:
df_table = df.rdd.map(lambda x: [float(y) for y in x['features']]).toDF(cols)
The problem with the above is that it does not retain the label column. As a workaround, I used a Join successfully to bring that label column back but I find that it's too convoluted.
How would I use a one-liner such as the above to both extract the features vector and make a Spark DF out of it and at the same time append that label column to it as well?