1

My Spark DataFrame has data in the following format:

to

The printSchema() shows that each column is of the type vector.

I tried to get the values out of [ and ] using the code below (for 1 columns col1):

from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType

firstelement=udf(lambda v:float(v[0]),FloatType())
df.select(firstelement('col1')).show()

However, how can I apply it to all columns of df?

Mohamed Ali JAMAOUI
  • 14,275
  • 14
  • 73
  • 117
Fluxy
  • 2,838
  • 6
  • 34
  • 63
  • Well... you are trying to give `DenseVector` objects to the model, I imagine from the error message. You should give pure numpy arrays. Where are you getting your input data from? – Daniel Möller Feb 18 '20 at 18:54
  • @DanielMöller: Thanks. As I mentioned, `x_train` and `x_test` are pandas data frames. I cannot find any reference to DenseVector... – Fluxy Feb 18 '20 at 18:57
  • They should be numpy arrays, not dataframes. `x_train_numpy = x_train.values`. – Daniel Möller Feb 18 '20 at 18:58
  • @DanielMöller: I changed them to numpy as you showed. Again the same error. – Fluxy Feb 18 '20 at 19:00
  • Then the dataframe is probably not in a good format. You should probably look into it and prepare the data yourself. – Daniel Möller Feb 18 '20 at 19:01
  • Can you print the type and shape of the numpy arrays? – Daniel Möller Feb 18 '20 at 19:11
  • @DanielMöller: You are right! All values inside pandas DataFrames are of the type `vector`, e.g. `[1]`, `[202]`. I should get them out of brackets... – Fluxy Feb 18 '20 at 19:14
  • @DanielMöller: If you know how to fix it, I'd appreciate and accept the answer. Thanks. – Fluxy Feb 18 '20 at 19:15

2 Answers2

4

1. Extract first element of a single vector column:

To get the first element of a vector column, you can use the answer from this SO: discussion Access element of a vector in a Spark DataFrame (Logistic Regression probability vector)

Here's a reproducible example:

>>> from pyspark.sql import functions as f
>>> from pyspark.sql.types import FloatType
>>> df = spark.createDataFrame([{"col1": [0.2], "col2": [0.25]},
                                {"col1": [0.45], "col2":[0.85]}])
>>> df.show()
+------+------+
|  col1|  col2|
+------+------+
| [0.2]|[0.25]|
|[0.45]|[0.85]|
+------+------+

>>> firstelement=f.udf(lambda v:float(v[0]),FloatType())
>>> df.withColumn("col1", firstelement("col1")).show()
+----+------+
|col1|  col2|
+----+------+
| 0.2|[0.25]|
|0.45|[0.85]|
+----+------+

2. Extract first element of multiple vector columns:

To generalize the above solution to multiple columns, apply a for loop. Here's an example:

>>> from pyspark.sql import functions as f
>>> from pyspark.sql.types import FloatType

>>> df = spark.createDataFrame([{"col1": [0.2], "col2": [0.25]},
                                {"col1": [0.45], "col2":[0.85]}])
>>> df.show()
+------+------+
|  col1|  col2|
+------+------+
| [0.2]|[0.25]|
|[0.45]|[0.85]|
+------+------+

>>> firstelement=f.udf(lambda v:float(v[0]),FloatType())
>>> df = df.select([firstelement(c).alias(c) for c in df.columns])
>>> df.show()
+----+----+
|col1|col2|
+----+----+
| 0.2|0.25|
|0.45|0.85|
+----+----+
Mohamed Ali JAMAOUI
  • 14,275
  • 14
  • 73
  • 117
1

As I understand your problem, you do not required to use UDF to change Vector into normal Float Type. Use pyspark predefined function concat_ws for it.

>>> from pyspark.sql.functions import *
>>> df.show()
+------+
|   num|
+------+
| [211]|
|[3412]|
| [121]|
| [121]|
|  [34]|
|[1441]|
+------+

>>> df.printSchema()
root
 |-- num: array (nullable = true)
 |    |-- element: string (containsNull = true)

>>> df.withColumn("num", concat_ws("", col("num"))).show()
+----+
| num|
+----+
| 211|
|3412|
| 121|
| 121|
|  34|
|1441|
+----+
Nikhil Suthar
  • 2,289
  • 1
  • 6
  • 24