I have a dataset that provides me what people have watched on TV, the duration they have watched and on which network they have watched. We have following columns:
TV ID - string
show_nw - Array which has TV show concatenated with Network
All_nws - Array which has network concatenated with Duration
All_shows - Array which has show concatenated with duration
Sample dataset:
TV_ID : a1001 , a1002, a1003
show_nw: ["TheFactsofLife#1001","Bewitched#1001","Survivor#1000","SEALTeam#1000","WhenWhalesWalkedJourneysinDeepTime#1002","PaidProgramming#1006"], ["AllEliteWrestlingDynamite#1003","TheAdjustmentBureau#1004","Charmed#1003"], ["TMJ4Now#1005"]
all_nws : ["1000#7062","1001#602","1002#40","1006#47"], ["1003#7328","1004#46"], ["1005#1543"]
all_shows : ["Bewitched#563","Survivor#6988","SEALTeam#74","WhenWhalesWalkedJourneysinDeepTime#40","PaidProgramming#47","TheFactsofLife#39"], ["Charmed#462","AllEliteWrestlingDynamite#6866","TheAdjustmentBureau#46"], ["TMJ4Now#1543"]
Now when I am exploding the dataset back from array
test_df = df.select("tv_id", "all_shows", "all_nws").withColumn("all_shows", explode("all_shows")).withColumn("all_nws", explode("all_nws")).withColumn("show",split(col("all_shows"),"#").getItem(0)).withColumn("network",split(col("all_nws"),"#").getItem(0))
My output looks like:
tv_id all_shows all_nws show network
a1001 Bewitched#563 1000#7062 Bewitched 1000
a1001 Bewitched#563 1001#602 Bewitched 1001
a1001 Bewitched#563 1002#40 Bewitched 1002
a1001 Bewitched#563 1006#47 Bewitched 1006
a1001 Survivor#6988 1000#7062 Survivor 1000
a1001 Survivor#6988 1001#602 Survivor 1001
a1001 Survivor#6988 1002#40 Survivor 1002
a1001 Survivor#6988 1006#47 Survivor 1006
So basically in the parent dataset, Bewitched and Survivor were watched only on network 1000 but when exploded, we found both of them were associated with all the networks that TV_ID had. How do I get the correct dataset after the explosion in this case?