Pivot- Missing Series Issue in Spark-Scala

Question

+-----------+----------+-----+
|   M       |     Index|c1   |
+-----------+----------+-----+
|M1         |         0|  224|
|M1         |         1|  748|
|M1         |         3|   56|
+-----------+----------+-----+

I have a DF like above. If I use pivot - df.groupBy("M").pivot("Index").agg(first("c1")), I ll get something like below.But this means I am missing '2' in the series. But,this may be silly but tricky, Is there any way to fill up the column series while doing pivot

+-----------+----+---+---+
|   M       |   0|  1|  3|
+-----------+----+---+---+
|M1         |224 |748| 56|
+-----------+----+---+---+

Expect Result

+-----------+----+---+---+--+
|   M       |   0|  1|  2|3 |
+-----------+----+---+---+---
|M1         |224 |748| 0 |56|
+-----------+----+---+---+--+

baitmbarek · Accepted Answer · 2019-12-10T12:21:07.837

2

Welcome to SO @abc_spark,

Supposing you don't have too many indexes in your table, you can try the following approach : Here I compute the max index value across the Dataset. Then for each index from 0 to maxIndex, I create a column with a default value = 0. Note I'm also filling the null values with zeros.

import spark.implicits._
import org.apache.spark.sql.functions._

val df = Seq(
  ("M1", 0, 224),
  ("M1", 1, 748),
  ("M1", 3, 56),
  ("M2", 3, 213)
).toDF("M", "Index", "c1")

val pivoted = df.groupBy("M").pivot("Index").agg(first("c1")).na.fill(0)

val maxValue = df.select(max($"Index")).collect.head.getAs[Int](0)

val withAllCols = (0 to maxValue).foldLeft(pivoted){case (df, idx) =>
  if(df.columns contains idx.toString) df
  else df.withColumn(idx.toString, lit(0))
}

withAllCols.show(false)

+---+---+---+---+---+
|M  |0  |1  |3  |2  |
+---+---+---+---+---+
|M2 |0  |0  |213|0  |
|M1 |224|748|56 |0  |
+---+---+---+---+---+

Edit : With sorted columns :

withAllCols
      .select("M", withAllCols.columns.filterNot(_ == "M").sortBy(_.toInt):_*)
      .show(false)

+---+---+---+---+---+
|M  |0  |1  |2  |3  |
+---+---+---+---+---+
|M2 |0  |0  |0  |213|
|M1 |224|748|0  |56 |
+---+---+---+---+---+

edited Dec 10 '19 at 12:21

answered Dec 09 '19 at 14:43

baitmbarek

2,440
4
18
26

Is there any way to get the column in sorted format(only integers) like- M,0,1,2,3 – abc_spark Dec 10 '19 at 09:08
Definately, you can either apply a select on resulting dataframe, based on withAllCols.columns.filterNot(_ == "M" ).sorted:_*, or just handle it in the foldLeft function (may slow your program). I'll edit my answer to include this – baitmbarek Dec 10 '19 at 09:13
But if the column number is more than 10, 10 comes after 1,which means somethng like this - Array(0, 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 3, 4, 5, 6, 7, 8, 9) – abc_spark Dec 10 '19 at 11:25
Right you'd have to sort Ints instead of strings :) but the logic remains the same – baitmbarek Dec 10 '19 at 11:27
I just edited the answer to use a sortBy instead of sorted :) – baitmbarek Dec 10 '19 at 12:21
Hi @baitmbarek Would you be able to advise on the question - https://stackoverflow.com/questions/62211108/finding-percentile-in-spark-scala-in-dynamic-way – abc_spark Jun 05 '20 at 10:21

Pivot- Missing Series Issue in Spark-Scala

1 Answers1