0
id texts vector
0 [a, b, c] (3,[0,1,2],[1.0,1.0,1.0])
1 [a, b, c] (3,[0,1,2],[2.0,2.0,1.0])

This is my above spark dataframe, I want to convert it to something like below -

id texts list_2
0 a 1.0
0 b 1.0
0 c 1.0
1 a 2.0
1 b 2.0
1 c 1.0

1 Answers1

0
from pyspark.sql.types import *
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql.functions import *



def to_array_(v):
 return v.toArray().tolist()
def to_vector_(v):
 return Vectors.dense(v)


to_array = udf(lambda z: to_array_(z),ArrayType(DoubleType())) #watch your return type
to_vector = udf(lambda z: to_vector_(z),VectorUDT()) # helper to make an example for your question.
getFeatureVector=udf(lambda v:v[2],VectorUDT()) #this should work on your Feature Vector, but I'm too lazy to contrive an example with Vectors of vectors.
getFeatureVectorExample=udf(lambda v:v[2],FloatType()) # This works for this example but gives you the general idea of how to access vectors.

schema = ["id","texts","vector"]
data = [
(0,['a', 'b', 'c'],[1.0,1.0,1.0]), #small cheat
(1,['a', 'b', 'c'],[2.0,2.0,1.0]),
]
df = spark.createDataFrame( data, schema )


df = df.withColumn("vector", to_vector(df.vector) ) #convert the array to a vector so I can prove this works
#DataFrame[id: bigint, texts: array<string>, vector: vector]

This may make you ask the question how do I access the element of vector to turn it into an array: (we use another udf that will translate for us.)

df.select(col('*'), getFeatureVectorExample( df.vector ) ).show()
+---+---------+-------------+----------------+
| id|    texts|       vector|<lambda>(vector)|
+---+---------+-------------+----------------+
|  0|[a, b, c]|[1.0,1.0,1.0]|             1.0|
|  1|[a, b, c]|[2.0,2.0,1.0]|             1.0|
+---+---------+-------------+----------------+

Ok so now we know how to get the element we're interest in so the rest of this example show how to convert a vector into an array, and then explode it.

df.withColumn( 'text', explode( df.texts) )\# I use with column as I'm lazy
.withColumn( 'feature', explode( to_array(df.vector) ) )\#can't have to explodes in 1 select so don't try to do that.
.drop('texts','vector')\#book keeping to clean up columns you don't want.
.show()
| id|text|feature|
+---+----+-------+
|  0|   a|    1.0|
|  0|   a|    1.0|
|  0|   a|    1.0|
|  0|   b|    1.0|
|  0|   b|    1.0|
|  0|   b|    1.0|
|  0|   c|    1.0|
|  0|   c|    1.0|
|  0|   c|    1.0|
|  1|   a|    2.0|
|  1|   a|    2.0|
|  1|   a|    1.0|
|  1|   b|    2.0|
|  1|   b|    2.0|
|  1|   b|    1.0|
|  1|   c|    2.0|
|  1|   c|    2.0|
|  1|   c|    1.0|
+---+----+-------+

To further clarify if you wish to access elements of a vector you can create a static function:

This function pulls the last element(2) of a vector out and returns it as a vector, but gives a hint to how to access other elements. getFeatureVector=udf(lambda v:v[2],VectorUDT()) If the elements are different types you will need to write extra logic to handle it and the return type: Here's an example to access the first(0) element of a vector and return it as a FloatType: getFeatureVectorExample=udf(lambda v:v[0],FloatType())

You can of course combine these elements and return a more complex structure, that may suit your needs. I suggest returning them as a struct as you can use 'column_name.*' to turn the columns from the struct as rows or struct_column.field_name to access elements and return them as columns. See this example for how to build out the return type.

Further example using multitple elements in struct and turning them into a column


def structExample(v):
 return (
  float(v[0]),       
  float(v[0])
 )
getstructExample=udf(structExample,StructType([StructField( "flt", FloatType(), False), StructField( "array", FloatType() ) ]))

df.select(col('*'), getstructExample( df.vector ).alias("struct") ).select(col("struct.*")).show()
+---+-----+
|flt|array|
+---+-----+
|1.0|  1.0|
|2.0|  2.0|
+---+-----+
Matt Andruff
  • 4,974
  • 1
  • 5
  • 21
  • Hi Matt, Thank you for the help. While I understood how you are exploding an array to multiple rows, Can you please guide how to extract this value in a column -> (3,[0,1,2],[1.0,1.0,1.0]) into three columns with values -> 3, [0, 1, 2] and [1.0, 1.0, 1.0] ? – Devansh Popat Jul 13 '22 at 17:00
  • I added to my answer to help explain in more detail if I get time tonight I'll write up a better example. – Matt Andruff Jul 13 '22 at 17:43
  • Added a little more detailed example, this should give you everything you need. – Matt Andruff Jul 13 '22 at 18:01