0

I have existing dataframe as:

Timestamp,         ID,   Value
29/08/2017 4:51:23, ID-1, 1.1
29/08/2017 4:52:14, ID-2, 2.1
29/08/2017 4:52:14, ID-3, 3.1
29/08/2017 4:55:23, ID-1, 1.2
29/08/2017 4:55:23, ID-3, 3.2
29/08/2017 4:57:42, ID-2, 2.2

I would like to create a dataframe from existing dataframe such that 'IDs' will become column names and 'Values' will become the data of respective columns as:

Timestamp,            ID-1,    ID-2,   ID-3
29/08/2017 4:51:23,    1.1,     null,   null
29/08/2017 4:52:14,    null,    2.1,    3.1
29/08/2017 4:55:23,    1.2,     null,   3.2
29/08/2017 4:57:42,    null,    2.2,    null

I am not able to figure out feasible solution for this in Scala. Help is appreciated. Thanks in advance.

philantrovert
  • 9,904
  • 3
  • 37
  • 61
Swati
  • 535
  • 11
  • 25

1 Answers1

1

You can simply use groupBy() together with piviot(). Using your example dataframe:

val spark = SparkSession.builder.getOrCreate()
import spark.implicits._

val df = ...

df.show()
+------------------+----+-----+
|         Timestamp|  ID|Value|
+------------------+----+-----+
|29/08/2017 4:51:23|ID-1|  1.1|
|29/08/2017 4:52:14|ID-2|  2.1|
|29/08/2017 4:52:14|ID-3|  3.1|
|29/08/2017 4:55:23|ID-1|  1.2|
|29/08/2017 4:55:23|ID-3|  3.2|
|29/08/2017 4:57:42|ID-2|  2.2|
+------------------+----+-----+

val newDF = df.groupBy("Timestamp")
  .pivot("ID")
  .agg(sum($"Value"))

newDF.show()
+------------------+----+----+----+
|         Timestamp|ID-1|ID-2|ID-3|
+------------------+----+----+----+
|29/08/2017 4:57:42|null| 2.2|null|
|29/08/2017 4:55:23| 1.2|null| 3.2|
|29/08/2017 4:51:23| 1.1|null|null|
|29/08/2017 4:52:14|null| 2.1| 3.1|
+------------------+----+----+----+

If the timestamp and id is the same for two or more rows, those values will be added together. If you want another behavior simply change the agg() method. E.g. average would be agg(avg($"Value")).

Hope it helps!

Shaido
  • 27,497
  • 23
  • 70
  • 73