0

I have csv file and loaded into a dataframe. there is a column maplist contain PFA image

raw_dataset I tried to put into html table but unable to do that why i have put into ; seprated format. this actual datasets format

|Asma Khan|060f1d2d-08a5-4c18-b0b3-f57cc95eaa96|[{share=N, product= Banana}, {share=N, product= Books}]|imasma@gmail.com|
|Ravi kumar|065fa4f9-2853-4fb3-84b0-1608e7f57c7b|[{share=N, product= Banana}, {share=N, product= Books}, {share=Y, product= Washroom}]|imravi@gmail.com|
|Sharukh Khan|cd1750dd-a459-4229-a698-4a4f4e66c666|[]|srk@gmail.com|

I want to split the column below into 2 columns and create rows as per the products.

expected output

Nithish
  • 3,062
  • 2
  • 8
  • 16
Ravi Kumar
  • 66
  • 7
  • Always use the following format https://stackoverflow.com/a/48427186/7989581 to structure questions. Thanks! – Nithish Nov 12 '21 at 11:18

1 Answers1

0

You can explode_outer the list and then access the elements using regex_extract.

explode_outer instead of explode as dataset contains empty lists and the corresponding row has to be preserved.

Input

+----------------+------------------------------------+-----------------------------------------------------------------------------+------------+
|email           |id                                  |maplist                                                                      |name        |
+----------------+------------------------------------+-----------------------------------------------------------------------------+------------+
|imasma@gmail.com|060f1d2d-08a5-4c18-b0b3-f57cc95eaa96|[{share=N,product=Banana},{share=N,product=Books}]                           |Asma Khan   |
|imravi@gmail.com|065fa4f9-2853-4fb3-84b0-1608e7f57c7b|[{share=N,product=Banana},{share=N,product=Books},{share=Y,product=Washroom}]|Ravi kumar  |
|srk@gmail.com   |cd1750dd-a459-4229-a698-4a4f4e66c666|[]                                                                           |Sharukh Khan|
+----------------+------------------------------------+-----------------------------------------------------------------------------+------------+

Input Schema

root
 |-- email: string (nullable = true)
 |-- id: string (nullable = true)
 |-- maplist: string (nullable = true)
 |-- name: string (nullable = true)

Working Example

import pyspark.sql.functions as F

data = [
  {
    "name": "Asma Khan",
    "id": "060f1d2d-08a5-4c18-b0b3-f57cc95eaa96",
    "maplist": "[{share=N,product=Banana},{share=N,product=Books}]",
    "email": "imasma@gmail.com"
  },
  {
    "name": "Ravi kumar",
    "id": "065fa4f9-2853-4fb3-84b0-1608e7f57c7b",
    "maplist": "[{share=N,product=Banana},{share=N,product=Books},{share=Y,product=Washroom}]",
    "email": "imravi@gmail.com"
  },
  {
    "name": "Sharukh Khan",
    "id": "cd1750dd-a459-4229-a698-4a4f4e66c666",
    "maplist": "[]",
    "email": "srk@gmail.com"
  }
]

df = spark.createDataFrame(data=data)

df.withColumn("split_maplist", F.split(F.col("maplist"), "},")).withColumn("exploded_maplist", F.explode_outer(F.col("split_maplist"))).withColumn("share", F.regexp_extract(F.col("exploded_maplist"), r"share=([a-zA-Z]),", 1)).withColumn("product", F.regexp_extract(F.col("exploded_maplist"), r"product=([a-zA-Z]+)", 1)).select("name", "email", "id", "share", "product").show(200, False)

Output

+------------+----------------+------------------------------------+-----+--------+
|name        |email           |id                                  |share|product |
+------------+----------------+------------------------------------+-----+--------+
|Asma Khan   |imasma@gmail.com|060f1d2d-08a5-4c18-b0b3-f57cc95eaa96|N    |Banana  |
|Asma Khan   |imasma@gmail.com|060f1d2d-08a5-4c18-b0b3-f57cc95eaa96|N    |Books   |
|Ravi kumar  |imravi@gmail.com|065fa4f9-2853-4fb3-84b0-1608e7f57c7b|N    |Banana  |
|Ravi kumar  |imravi@gmail.com|065fa4f9-2853-4fb3-84b0-1608e7f57c7b|N    |Books   |
|Ravi kumar  |imravi@gmail.com|065fa4f9-2853-4fb3-84b0-1608e7f57c7b|Y    |Washroom|
|Sharukh Khan|srk@gmail.com   |cd1750dd-a459-4229-a698-4a4f4e66c666|     |        |
+------------+----------------+------------------------------------+-----+--------+
Nithish
  • 3,062
  • 2
  • 8
  • 16
  • Output is different from the expected output. Please check the attachment of input and output – Ravi Kumar Nov 13 '21 at 07:08
  • @RaviKumar The logic works regardless of the input given, but to reduce ambiguity for you, I have modified the input based on the image you had attached, if you want other columns included, paste the input as text as is the guideline, I dont want to parse an image to generate input. – Nithish Nov 13 '21 at 08:09
  • I have paste raw dataset in question. please try to split into expected output format. – Ravi Kumar Nov 13 '21 at 13:34
  • @RaviKumar modified the code to include your dataset – Nithish Nov 13 '21 at 14:59
  • it is giving error, can we connect on https://www.linkedin.com/in/ravi-kumar-84800873/. – Ravi Kumar Nov 13 '21 at 15:58
  • @RaviKumar if the answer help. do upvote and accept it so people having similar problems know what works. Thanks! – Nithish Nov 13 '21 at 18:59