1

Let's say I have the next dataframe:

val df = spark.sparkContext.parallelize(Seq(
        ("A", "12", 50),
        ("A", "13", 100),
        ("A", "14", 30),
        ("B", "15", 40),
        ("C", "16", 60),
        ("C", "17", 70)
      )).toDF("Name", "Time", "Value")

and I pivot by 'Time':

val pivoted = df.groupBy($"Name").
    pivot("Time").
    agg(coalesce(sum($"Value"),lit(0)))

pivoted.show()

which results in:

+----+----+----+----+----+----+----+
|Name|  12|  13|  14|  15|  16|  17|
+----+----+----+----+----+----+----+
|   B|null|null|null|  40|null|null|
|   C|null|null|null|null|  60|  70|
|   A|  50| 100|  30|null|null|null|
+----+----+----+----+----+----+----+

Until then, everything is ok. What I want, is to add a column next to 'column 17' calculating the sum of each row. So the expected output should be:

+----+----+----+----+----+----+----+----+
|Name|  12|  13|  14|  15|  16|  17|sum |
+----+----+----+----+----+----+----+----+
|   B|null|null|null|  40|null|null|40  |
|   C|null|null|null|null|  60|  70|130 |
|   A|  50| 100|  30|null|null|null|180 |
+----+----+----+----+----+----+----+----+

(Noobly, )I've tried adding a 'withColumn', but it fails:

val pivotedWithSummation = df.groupBy($"Name").
    pivot("Time").
    agg(coalesce(sum($"Value"),lit(0))).
    withColumn("summation", sum($"Value"))

I came with this answer, but I couldn't apply it :/

I'm using Scala v.2.11.8 and Spark 2.3.1

Thanks in advance!

1 Answers1

1

get the sum of values from original input dataframe and join with your pivoted dataframe

scala> val pivoted = df.groupBy($"Name").pivot("Time").agg(coalesce(sum($"Value"),lit(0)))
pivoted: org.apache.spark.sql.DataFrame = [Name: string, 12: bigint ... 5 more fields]

scala> pivoted.show
+----+----+----+----+----+----+----+
|Name|  12|  13|  14|  15|  16|  17|
+----+----+----+----+----+----+----+
|   B|null|null|null|  40|null|null|
|   C|null|null|null|null|  60|  70|
|   A|  50| 100|  30|null|null|null|
+----+----+----+----+----+----+----+


scala> val sumOfValuesDF = df.groupBy($"Name").sum("value")
sumOfValuesDF: org.apache.spark.sql.DataFrame = [Name: string, sum(value): bigint]

scala> sumOfValuesDF.show
+----+----------+
|Name|sum(value)|
+----+----------+
|   B|        40|
|   C|       130|
|   A|       180|
+----+----------+


scala> val pivotedWithSummation = pivoted.join(sumOfValuesDF, "Name")
pivotedWithSummation: org.apache.spark.sql.DataFrame = [Name: string, 12: bigint ... 6 more fields]

scala> pivotedWithSummation.show
+----+----+----+----+----+----+----+----------+
|Name|  12|  13|  14|  15|  16|  17|sum(value)|
+----+----+----+----+----+----+----+----------+
|   B|null|null|null|  40|null|null|        40|
|   C|null|null|null|null|  60|  70|       130|
|   A|  50| 100|  30|null|null|null|       180|
+----+----+----+----+----+----+----+----------+
C.S.Reddy Gadipally
  • 1,718
  • 11
  • 22