Spark SQL Selecting All Columns From First Row In Group

Question

I have the following dataframe but I cannot work out how to extract all the columns the first row of a group.

+--------------------+------------+--------+
|           timestamp|nanos       |file_idx|
+--------------------+------------+--------+
|2018-09-07 05:00:...|    64044267|      1 |
|2018-09-07 05:00:...|    64044267|      2 |
|2018-09-07 05:00:...|    58789223|      3 |
+--------------------+------------+--------+

How can do I extract the row with the biggest file_idx for the same timestamp and nanosecond? I've tried using a groupBy function but it only returns those columns in my group by clause, where in reality this table contains 160 columns.

The desired outcome in the above example would be

+--------------------+------------+--------+
|           timestamp|nanos       |file_idx|
+--------------------+------------+--------+
|2018-09-07 05:00:...|    64044267|      2 |
|2018-09-07 05:00:...|    58789223|      3 |
+--------------------+------------+--------+

what is your desired output, could you please give a sample record — code.gsoni, Mar 04 '20 at 12:22
Does this answer your question? [How to select the first row of each group?](https://stackoverflow.com/questions/33878370/how-to-select-the-first-row-of-each-group) — user10938362, Mar 04 '20 at 12:27

notNull · Accepted Answer · 2020-03-04T12:38:46.620

Use window row_number() function with partitionBy("timestamp","nanos") and orderby("file_idx") descending finally filter only the highest file_idx row from the window.

Example:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

//sample data
df.show()

//+----------------+--------+--------+
//|       timestamp|   nanos|file_idx|
//+----------------+--------+--------+
//|2018-09-07 05:00|64044267|       1|
//|2018-09-07 05:00|64044267|       2|
//|2018-09-07 05:00|58789223|       3|
//+----------------+--------+--------+

val windowSpec = Window.partitionBy("timestamp","nanos").orderBy(desc("file_idx"))

df.withColumn("new_idx",row_number().over(windowSpec)).
filter(col("new_idx") ===1).
drop("new_idx").
show()

//+----------------+--------+--------+
//|       timestamp|   nanos|file_idx|
//+----------------+--------+--------+
//|2018-09-07 05:00|64044267|       2|
//|2018-09-07 05:00|58789223|       3|
//+----------------+--------+--------+

score 1 · Answer 2 · answered Mar 04 '20 at 13:38

If you are performing group on any column than logically you can retrieve only those columns within a group or applying any aggregation function.

If you want to extract your 160 remaining columns from your dataset after grouping on any column, all columns contain for a group set of values and if you want any value from the set then use function first().

select timestamp, first(col1), first(col2), first(col160) from Table group by timestamp;

Yes, I was seeking for an answer not needing sorting – Onur Demir Jun 20 '23 at 13:53 — Onur Demir, Jun 20 '23 at 13:53

Spark SQL Selecting All Columns From First Row In Group

2 Answers2