Assuming we have the following DF
scala> df.show
+---+----+----+----+-------------------+---+
| id|name| cnt| amt| dt|scn|
+---+----+----+----+-------------------+---+
| 1|null| 1|1.12|2000-01-02 00:11:11|112|
| 1| aaa| 1|1.11|2000-01-01 00:00:00|111|
| 2| bbb|null|2.22|2000-01-03 12:12:12|201|
| 2|null| 2|1.13| null|200|
| 2|null|null|2.33| null|202|
| 3| ccc| 3|3.34| null|302|
| 3|null|null|3.33| null|301|
| 3|null|null| 0.0|2000-12-31 23:59:59|300|
+---+----+----+----+-------------------+---+
I want to get the following DF - sorted by scn
, groupped by id
and take the last not-null value for every column (except id
and scn
).
It can be done like this:
scala> :paste
// Entering paste mode (ctrl-D to finish)
df.orderBy("scn")
.groupBy("id")
.agg(last("name", true) as "name",
last("cnt", true) as "cnt",
last("amt", true) as "amt",
last("dt", true) as "dt")
.show
// Exiting paste mode, now interpreting.
+---+----+---+----+-------------------+
| id|name|cnt| amt| dt|
+---+----+---+----+-------------------+
| 1| aaa| 1|1.12|2000-01-02 00:11:11|
| 3| ccc| 3|3.34|2000-12-31 23:59:59|
| 2| bbb| 2|2.33|2000-01-03 12:12:12|
+---+----+---+----+-------------------+
In real life I want to process different DFs with a large amount of columns.
My question is - how can i specify all columns (except id
and scn
) in the .agg(last(col_name, true))
programmatically?
Code for generating a source DF:
case class C(id: Integer, name: String, cnt: Integer, amt: Double, dt: String, scn: Integer)
val cc = Seq(
C(1, null, 1, 1.12, "2000-01-02 00:11:11", 112),
C(1, "aaa", 1, 1.11, "2000-01-01 00:00:00", 111),
C(2, "bbb", null, 2.22, "2000-01-03 12:12:12", 201),
C(2, null, 2, 1.13, null,200),
C(2, null, null, 2.33, null, 202),
C(3, "ccc", 3, 3.34, null, 302),
C(3, null, null, 3.33, "20001-01-01 00:33:33", 301),
C(3, null, null, 0.00, "2000-12-31 23:59:59", 300)
)
val t = sc.parallelize(cc, 4).toDF()
val df = t.withColumn("dt", $"dt".cast("timestamp"))
val cols = df.columns.filterNot(_.equals("id"))