2

If I understand correctly using groupBy().agg(collect_list(column)) will get me a column of list. How do I get the first and last element from that list to create a new column (in Spark Dataset Java)?

For first, I can do something like this

.withColumn("firstItem", functions.col("list").getItem(0))

but how do I handle empty list?

For last item, I was thinking about size()-1, but in Java, -1 isn't supported in Spark data set, I tried:

withColumn("lastItem", function.col("list").getItem(functions.size(functions.col("list")).minus(1))

but it will complaint something about unsupported type error.

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
Alex
  • 57
  • 1
  • 5
  • 1
    Using `groupBy` and `collect_list` will change the order of the items. It would be better to look into the window functions where you can use `orderBy` and the `first` and `last` methods. – Shaido Feb 08 '18 at 06:55

2 Answers2

2

To answer your questions:

but how do I handle empty list?

Just don't worry about it. Access to non existing index gives NULL (undefined) so there is no problem here.

If you want some default value use org.apache.spark.sql.functions.coalesce with org.apache.spark.sql.functions.lit.

For last item, I was thinking about size()-1, but in Java, -1 isn't supported

Use apply, not getItem:

import static org.apache.spark.sql.functions.*;

col("list").apply(size(col("list")).minus(lit(1)));

In practice:

Just use min, max functions. Don't replicate groupByKey in SQL.

Related:

How to select the first row of each group?

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
2

An empty list will simply return null instead of any error. Do this for the last item.

import org.apache.spark.sql.functions._
withColumn("lastItem", reverse(col("list")).getItem(0))
Pavindu
  • 2,684
  • 6
  • 44
  • 77