My dataframe
df_a = spark.createDataFrame( [
(0, ["B","C","D","E"] , [1,2,3,4] ),
(1,["E","A","C"] , [1,2,3] ),
(2, ["F","A","E","B"] , [1,2,3,4]),
(3,["E","G","A"] , [1,2,3 ]),
(4,["A","C","E","B","D"] , [1,2,3,4,5])] , ["id","items",'rank'])
and i want my output as :
+---+----+----+
| id|item|rank|
+---+----+----+
| 0| B| 1|
| 0| C| 2|
| 0| D| 3|
| 0| E| 4|
| 1| E| 1|
| 1| A| 2|
| 1| C| 3|
| 2| F| 1|
| 2| A| 2|
| 2| E| 3|
| 2| B| 4|
| 3| E| 1|
| 3| G| 2|
| 3| A| 3|
| 4| A| 1|
| 4| C| 2|
| 4| E| 3|
| 4| B| 4|
| 4| D| 5|
+---+----+----+
my dataframe has 8 million rows and when i try to zip and explode as below, its extremely slow and job runs forever using 15executors and 25GB memory
zip_udf2 = F.udf(
lambda x, y: list(zip(x, y)),
ArrayType(StructType([
StructField("item", StringType()),
StructField("rank", IntegerType())
]))
)
(df_a
.withColumn('tmp', zip_udf2("items", "rank"))
.withColumn("tmp", F.explode('tmp'))
.select("id", F.col("tmp.item"), F.col("tmp.rank"))
.show())
Any alternate methods? i tried rdd.flatMap still didn't make a dent on the performance. number of elements in the arrays in each row varies.