9

When I try to group my dataframe on a column then try to find the minimum for each grouping groupbyDatafram.min('timestampCol') it appears I cannot do it on non numerical columns. Then how can I properly filter the minimum (earliest) date on the groupby?

I am streaming the dataframe from a postgresql S3 instance, so that data is already configured.

zero323
  • 322,348
  • 103
  • 959
  • 935
Jake Fund
  • 395
  • 2
  • 6
  • 16

1 Answers1

16

Just perform aggregation directly instead of using min helper:

import org.apache.spark.sql.functions.min

val sqlContext: SQLContext = ???

import sqlContext.implicits._

val df = Seq((1L, "2016-04-05 15:10:00"), (1L, "2014-01-01 15:10:00"))
  .toDF("id", "ts")
  .withColumn("ts", $"ts".cast("timestamp"))

df.groupBy($"id").agg(min($"ts")).show

// +---+--------------------+
// | id|             min(ts)|
// +---+--------------------+
// |  1|2014-01-01 15:10:...|
// +---+--------------------+

Unlike min it will work on any Orderable type.

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935
  • 3
    Try: `import org.apache.spark.sql.functions._` – David Griffin Apr 05 '16 at 13:29
  • Thanks, real quick sorry this is a different question but I get org.apache.spark.sql.DataFrame = [MIN(ts): timestamp], I thought it was suppose to add the col MIN(ts) to the groupBy object not simply return the single column as a DataFrame. do I need to join this somehow with the original groupBy object? – Jake Fund Apr 05 '16 at 13:38
  • Grouping columns are retained only in Spark 1.4+. See https://issues.apache.org/jira/browse/SPARK-7462. Before that you should include required columns in `agg` clause (for example `agg($"id", min($"ts"))`. – zero323 Apr 05 '16 at 13:40
  • a bit unrelated but groupBy operations are expensive. do you think another alternative not to use group by? http://stackoverflow.com/questions/39889874/groupby-operation-of-dataframe-takes-lot-of-time-in-spark-2-0 – Mohammad Adnan Oct 06 '16 at 17:12