You can explode_outer
the list and then access the elements using regex_extract
.
explode_outer
instead of explode
as dataset contains empty lists and the corresponding row has to be preserved.
Input
+----------------+------------------------------------+-----------------------------------------------------------------------------+------------+
|email |id |maplist |name |
+----------------+------------------------------------+-----------------------------------------------------------------------------+------------+
|imasma@gmail.com|060f1d2d-08a5-4c18-b0b3-f57cc95eaa96|[{share=N,product=Banana},{share=N,product=Books}] |Asma Khan |
|imravi@gmail.com|065fa4f9-2853-4fb3-84b0-1608e7f57c7b|[{share=N,product=Banana},{share=N,product=Books},{share=Y,product=Washroom}]|Ravi kumar |
|srk@gmail.com |cd1750dd-a459-4229-a698-4a4f4e66c666|[] |Sharukh Khan|
+----------------+------------------------------------+-----------------------------------------------------------------------------+------------+
Input Schema
root
|-- email: string (nullable = true)
|-- id: string (nullable = true)
|-- maplist: string (nullable = true)
|-- name: string (nullable = true)
Working Example
import pyspark.sql.functions as F
data = [
{
"name": "Asma Khan",
"id": "060f1d2d-08a5-4c18-b0b3-f57cc95eaa96",
"maplist": "[{share=N,product=Banana},{share=N,product=Books}]",
"email": "imasma@gmail.com"
},
{
"name": "Ravi kumar",
"id": "065fa4f9-2853-4fb3-84b0-1608e7f57c7b",
"maplist": "[{share=N,product=Banana},{share=N,product=Books},{share=Y,product=Washroom}]",
"email": "imravi@gmail.com"
},
{
"name": "Sharukh Khan",
"id": "cd1750dd-a459-4229-a698-4a4f4e66c666",
"maplist": "[]",
"email": "srk@gmail.com"
}
]
df = spark.createDataFrame(data=data)
df.withColumn("split_maplist", F.split(F.col("maplist"), "},")).withColumn("exploded_maplist", F.explode_outer(F.col("split_maplist"))).withColumn("share", F.regexp_extract(F.col("exploded_maplist"), r"share=([a-zA-Z]),", 1)).withColumn("product", F.regexp_extract(F.col("exploded_maplist"), r"product=([a-zA-Z]+)", 1)).select("name", "email", "id", "share", "product").show(200, False)
Output
+------------+----------------+------------------------------------+-----+--------+
|name |email |id |share|product |
+------------+----------------+------------------------------------+-----+--------+
|Asma Khan |imasma@gmail.com|060f1d2d-08a5-4c18-b0b3-f57cc95eaa96|N |Banana |
|Asma Khan |imasma@gmail.com|060f1d2d-08a5-4c18-b0b3-f57cc95eaa96|N |Books |
|Ravi kumar |imravi@gmail.com|065fa4f9-2853-4fb3-84b0-1608e7f57c7b|N |Banana |
|Ravi kumar |imravi@gmail.com|065fa4f9-2853-4fb3-84b0-1608e7f57c7b|N |Books |
|Ravi kumar |imravi@gmail.com|065fa4f9-2853-4fb3-84b0-1608e7f57c7b|Y |Washroom|
|Sharukh Khan|srk@gmail.com |cd1750dd-a459-4229-a698-4a4f4e66c666| | |
+------------+----------------+------------------------------------+-----+--------+