How to split single row into multiple rows in Spark DataFrame using Java

Question

I have a table as shown:

I want to transform it into the following table using Spark Java or Spark Scala

score 4 · Accepted Answer · answered Oct 21 '18 at 05:25

4

make sure you have unique column names, denn you can do :

import or.apache.spark.sql.functions._

table
  .select("id","movie",explode(array("cast1", "cast2", "cast3", "cast4")).as("cast"))
  .where(col("cast").isNotNull)

answered Oct 21 '18 at 05:25

Raphael Roth

26,751
15
88
145

score 1 · Answer 2 · answered Oct 22 '18 at 07:12

With "union":

val table = List(
  (101, "ABC", "A", "B", "C", "D"),
  (102, "XZY", "G", "J", null, null))
  .toDF("ID", "Movie", "Cast1", "Cast2", "Cast3", "Cast4")

val columnsToUnion = List("Cast1", "Cast2", "Cast3", "Cast4")
val result = columnsToUnion.map(name => table.select($"ID", $"Movie", col(name).alias("Cast")).where(col(name).isNotNull))
  .reduce(_ union _)
result.show(false)

Output:

+---+-----+----+
|ID |Movie|Cast|
+---+-----+----+
|101|ABC  |A   |
|102|XZY  |G   |
|101|ABC  |B   |
|102|XZY  |J   |
|101|ABC  |C   |
|101|ABC  |D   |
+---+-----+----+

NOTE: Table cannot has several columns with the same name, assuming column names have such pattern: "Cast[i]"

Fermat's Little Student · Answer 3 · 2018-10-21T22:25:34.293

0

table.groupBy("ID", "Movie")
  .agg(collect_list("Cast1", "Cast2", "Cast3", "Cast2").as("cast"))
  .withColumn("cast", explode("cast"))

// a side note: you should always avoid duplicate column name in the same DataFrame

edited Oct 21 '18 at 22:25

answered Oct 21 '18 at 04:50

Fermat's Little Student

5,549
7
49
70

How to split single row into multiple rows in Spark DataFrame using Java

3 Answers3

Linked