-1

I have the following dataframe but I cannot work out how to extract all the columns the first row of a group.

+--------------------+------------+--------+
|           timestamp|nanos       |file_idx|
+--------------------+------------+--------+
|2018-09-07 05:00:...|    64044267|      1 |
|2018-09-07 05:00:...|    64044267|      2 |
|2018-09-07 05:00:...|    58789223|      3 |
+--------------------+------------+--------+

How can do I extract the row with the biggest file_idx for the same timestamp and nanosecond? I've tried using a groupBy function but it only returns those columns in my group by clause, where in reality this table contains 160 columns.

The desired outcome in the above example would be

+--------------------+------------+--------+
|           timestamp|nanos       |file_idx|
+--------------------+------------+--------+
|2018-09-07 05:00:...|    64044267|      2 |
|2018-09-07 05:00:...|    58789223|      3 |
+--------------------+------------+--------+
john
  • 709
  • 3
  • 13
  • 25

2 Answers2

1

Use window row_number() function with partitionBy("timestamp","nanos") and orderby("file_idx") descending finally filter only the highest file_idx row from the window.

Example:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

//sample data
df.show()

//+----------------+--------+--------+
//|       timestamp|   nanos|file_idx|
//+----------------+--------+--------+
//|2018-09-07 05:00|64044267|       1|
//|2018-09-07 05:00|64044267|       2|
//|2018-09-07 05:00|58789223|       3|
//+----------------+--------+--------+

val windowSpec = Window.partitionBy("timestamp","nanos").orderBy(desc("file_idx"))

df.withColumn("new_idx",row_number().over(windowSpec)).
filter(col("new_idx") ===1).
drop("new_idx").
show()

//+----------------+--------+--------+
//|       timestamp|   nanos|file_idx|
//+----------------+--------+--------+
//|2018-09-07 05:00|64044267|       2|
//|2018-09-07 05:00|58789223|       3|
//+----------------+--------+--------+
notNull
  • 30,258
  • 4
  • 35
  • 50
1

If you are performing group on any column than logically you can retrieve only those columns within a group or applying any aggregation function.

If you want to extract your 160 remaining columns from your dataset after grouping on any column, all columns contain for a group set of values and if you want any value from the set then use function first().

select timestamp, first(col1), first(col2), first(col160) from Table group by timestamp;

Atul Verma
  • 51
  • 10