Convert Array with nested struct to string column along with other columns from the PySpark DataFrame

Question

This is similar to Pyspark: cast array with nested struct to string

But, the accepted answer is not working for my case, so asking here

|-- Col1: string (nullable = true)
|-- Col2: array (nullable = true)
    |-- element: struct (containsNull = true)
          |-- Col2Sub: string (nullable = true)

Sample JSON

{"Col1":"abc123","Col2":[{"Col2Sub":"foo"},{"Col2Sub":"bar"}]}

This gives result in a single column

import pyspark.sql.functions as F
df.selectExpr("EXPLODE(Col2) AS structCol").select(F.expr("concat_ws(',', structCol.*)").alias("Col2_concated")).show()
    +----------------+
    | Col2_concated  |
    +----------------+
    |foo,bar         |
    +----------------+

But, how to get a result or DataFrame like this

+-------+---------------+
|Col1   | Col2_concated |
+-------+---------------+
|abc123 |foo,bar        |
+-------+---------------+

EDIT: This solution gives the wrong result

df.selectExpr("Col1","EXPLODE(Col2) AS structCol").select("Col1", F.expr("concat_ws(',', structCol.*)").alias("Col2_concated")).show() 


+-------+---------------+
|Col1   | Col2_concated |
+-------+---------------+
|abc123 |foo            |
+-------+---------------+
|abc123 |bar            |
+-------+---------------+

You can select "Col1", though it is not expression df.selectExpr("Col1","EXPLODE(Col2) AS structCol").select("Col1", F.expr("concat_ws(',', structCol.*)").alias("Col2_concated")).show() — Ranga Vure, Mar 15 '19 at 09:57
Sorry, this doesn't work, I added the reason in the question — Watt, Mar 15 '19 at 15:43

score 1 · Accepted Answer · answered Mar 18 '19 at 12:14

Just avoid the explode and you are already there. All you need is the concat_ws function. This function concatenates multiple string columns with a given seperator. See example below:

from pyspark.sql import functions as F
j = '{"Col1":"abc123","Col2":[{"Col2Sub":"foo"},{"Col2Sub":"bar"}]}'
df = spark.read.json(sc.parallelize([j]))

#printSchema tells us the column names we can use with concat_ws                                                                              
df.printSchema()

Output:

root
 |-- Col1: string (nullable = true)
 |-- Col2: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Col2Sub: string (nullable = true)

The column Col2 is an array of Col2Sub and we can use this column name to get the desired result:

bla = df.withColumn('Col2', F.concat_ws(',', df.Col2.Col2Sub))

bla.show()
+------+-------+                                                                
|  Col1|   Col2|
+------+-------+
|abc123|foo,bar|
+------+-------+

Thanks!, you are a lifesaver :) – Watt Mar 18 '19 at 17:49 — Watt, Mar 18 '19 at 17:49

Convert Array with nested struct to string column along with other columns from the PySpark DataFrame

1 Answers1