I have a table as shown:
I want to transform it into the following table using Spark Java or Spark Scala
I have a table as shown:
I want to transform it into the following table using Spark Java or Spark Scala
make sure you have unique column names, denn you can do :
import or.apache.spark.sql.functions._
table
.select("id","movie",explode(array("cast1", "cast2", "cast3", "cast4")).as("cast"))
.where(col("cast").isNotNull)
With "union":
val table = List(
(101, "ABC", "A", "B", "C", "D"),
(102, "XZY", "G", "J", null, null))
.toDF("ID", "Movie", "Cast1", "Cast2", "Cast3", "Cast4")
val columnsToUnion = List("Cast1", "Cast2", "Cast3", "Cast4")
val result = columnsToUnion.map(name => table.select($"ID", $"Movie", col(name).alias("Cast")).where(col(name).isNotNull))
.reduce(_ union _)
result.show(false)
Output:
+---+-----+----+
|ID |Movie|Cast|
+---+-----+----+
|101|ABC |A |
|102|XZY |G |
|101|ABC |B |
|102|XZY |J |
|101|ABC |C |
|101|ABC |D |
+---+-----+----+
NOTE: Table cannot has several columns with the same name, assuming column names have such pattern: "Cast[i]"
table.groupBy("ID", "Movie")
.agg(collect_list("Cast1", "Cast2", "Cast3", "Cast2").as("cast"))
.withColumn("cast", explode("cast"))
// a side note: you should always avoid duplicate column name in the same DataFrame