Exploding and Splitting columns causing data mismatch

Question

I have a dataset that provides me what people have watched on TV, the duration they have watched and on which network they have watched. We have following columns:

TV ID - string
show_nw - Array which has TV show concatenated with Network
All_nws - Array which has network concatenated with Duration
All_shows - Array which has show concatenated with duration

Sample dataset:

 TV_ID : a1001 , a1002, a1003
 show_nw:  ["TheFactsofLife#1001","Bewitched#1001","Survivor#1000","SEALTeam#1000","WhenWhalesWalkedJourneysinDeepTime#1002","PaidProgramming#1006"], ["AllEliteWrestlingDynamite#1003","TheAdjustmentBureau#1004","Charmed#1003"], ["TMJ4Now#1005"]
 all_nws  : ["1000#7062","1001#602","1002#40","1006#47"], ["1003#7328","1004#46"], ["1005#1543"]
 all_shows : ["Bewitched#563","Survivor#6988","SEALTeam#74","WhenWhalesWalkedJourneysinDeepTime#40","PaidProgramming#47","TheFactsofLife#39"], ["Charmed#462","AllEliteWrestlingDynamite#6866","TheAdjustmentBureau#46"], ["TMJ4Now#1543"]

Now when I am exploding the dataset back from array

test_df = df.select("tv_id", "all_shows", "all_nws").withColumn("all_shows", explode("all_shows")).withColumn("all_nws", explode("all_nws")).withColumn("show",split(col("all_shows"),"#").getItem(0)).withColumn("network",split(col("all_nws"),"#").getItem(0))

My output looks like:

tv_id    all_shows            all_nws          show           network
a1001    Bewitched#563        1000#7062        Bewitched      1000
a1001    Bewitched#563        1001#602         Bewitched      1001
a1001    Bewitched#563        1002#40          Bewitched      1002
a1001    Bewitched#563        1006#47          Bewitched      1006
a1001    Survivor#6988        1000#7062        Survivor       1000
a1001    Survivor#6988        1001#602         Survivor       1001
a1001    Survivor#6988        1002#40          Survivor       1002
a1001    Survivor#6988        1006#47          Survivor       1006

So basically in the parent dataset, Bewitched and Survivor were watched only on network 1000 but when exploded, we found both of them were associated with all the networks that TV_ID had. How do I get the correct dataset after the explosion in this case?

score 2 · Accepted Answer · answered May 19 '20 at 16:42

I think you need to do zip arrays before exploding.

From Spark-2.4 use arrays_zip function.
For Spark < 2.4 need to use udf.

Example:(Spark-2.4)

#sample dataframe.
df.show()
#+-----+--------------------+--------------------+--------------------+
#|tv_id|             show_nw|             all_nws|           all_shows|
#+-----+--------------------+--------------------+--------------------+
#|a1001|[TheFactsofLife#1...|[1000#7062, 1001#...|[Bewitched#563, S...|
#|a1002|[AllEliteWrestlin...|[1003#7328, 1004#46]|[Charmed#462, All...|
#|a1003|      [TMJ4Now#1005]|         [1005#1543]|      [TMJ4Now#1543]|
#+-----+--------------------+--------------------+--------------------+

df.select("tv_id", "all_shows", "all_nws").\
withColumn("all_shows", explode(arrays_zip("all_shows","all_nws"))).\
select("tv_id","all_shows.*").\
withColumn("show",split(col("all_shows"),"#").getItem(0)).\
withColumn("network",split(col("all_nws"),"#").getItem(0)).\
show()
#+-----+--------------------+---------+--------------------+-------+
#|tv_id|           all_shows|  all_nws|                show|network|
#+-----+--------------------+---------+--------------------+-------+
#|a1001|       Bewitched#563|1000#7062|           Bewitched|   1000|
#|a1001|       Survivor#6988| 1001#602|            Survivor|   1001|
#|a1001|         SEALTeam#74|  1002#40|            SEALTeam|   1002|
#|a1001|WhenWhalesWalkedJ...|  1006#47|WhenWhalesWalkedJ...|   1006|
#|a1001|  PaidProgramming#47|     null|     PaidProgramming|   null|
#|a1001|   TheFactsofLife#39|     null|      TheFactsofLife|   null|
#|a1002|         Charmed#462|1003#7328|             Charmed|   1003|
#|a1002|AllEliteWrestling...|  1004#46|AllEliteWrestling...|   1004|
#|a1002|TheAdjustmentBure...|     null| TheAdjustmentBureau|   null|
#|a1003|        TMJ4Now#1543|1005#1543|             TMJ4Now|   1005|
#+-----+--------------------+---------+--------------------+-------+

Thats great. But one thing i am seeing in ur output is that some are coming as null in network — Bitanshu Das, May 20 '20 at 05:52
@BitanshuDas, We are zipping two arrays and `all_shows` column has 6 elements in array and `all_nws ` have 4 elements in array so last ** two elements** doesn't have zip in `all_nws` column this results null in network! — notNull, May 20 '20 at 06:20

Exploding and Splitting columns causing data mismatch

1 Answers1