I have a spark dataframe and one of its fields is an array of Row structures. I need to expand it into their own columns. One of the problems is in the array, sometimes a field is missing.
The following is an example:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql import functions as udf
spark = SparkSession.builder.getOrCreate()
# data
rows = [{'status':'active','member_since':1990,'info':[Row(tag='name',value='John'),Row(tag='age',value='50'),Row(tag='phone',value='1234567')]},
{'status':'inactive','member_since':2000,'info':[Row(tag='name',value='Tom'),Row(tag='phone',value='1234567')]},
{'status':'active','member_since':2015,'info':[Row(tag='name',value='Steve'),Row(tag='age',value='28')]}]
# create dataframe
df = spark.createDataFrame(rows)
# transform info to dict
to_dict = udf.UserDefinedFunction(lambda s:dict(s),MapType(StringType(),StringType()))
df = df.withColumn("info_dict",to_dict("info"))
# extract name, NA if not exists
extract_name = udf.UserDefinedFunction(lambda s:s.get("name","NA"))
df = df.withColumn("name",extract_name("info_dict"))
# extract age, NA if not exists
extract_age = udf.UserDefinedFunction(lambda s:s.get("age","NA"))
df = df.withColumn("age",extract_age("info_dict"))
# extract phone, NA if not exists
extract_phone = udf.UserDefinedFunction(lambda s:s.get("phone","NA"))
df = df.withColumn("phone",extract_phone("info_dict"))
df.show()
You can see for 'Tom', 'age' is missing; for 'Steve', 'phone' is missing. Like the above code snippet, my current solution is to first transform the array into dict and then parse each individual field into their column. The result is like this:
+--------------------+------------+--------+--------------------+-----+---+-------+
| info|member_since| status| info_dict| name|age| phone|
+--------------------+------------+--------+--------------------+-----+---+-------+
|[[name, John], [a...| 1990| active|[name -> John, ph...| John| 50|1234567|
|[[name, Tom], [ph...| 2000|inactive|[name -> Tom, pho...| Tom| NA|1234567|
|[[name, Steve], [...| 2015| active|[name -> Steve, a...|Steve| 28| NA|
+--------------------+------------+--------+--------------------+-----+---+-------+
I really just want the columns 'status','member_since','name', 'age' and 'phone'. This solution works but rather slow because of the UDF. Is there any faster alternatives? Thanks