how to get max(date) from given set of data grouped by some fields using pyspark?

Question

I have the data in the dataframe as below:

  datetime             | userId | memberId | value |    
2016-04-06 16:36:...   | 1234   | 111      | 1
2016-04-06 17:35:...   | 1234   | 222      | 5
2016-04-06 17:50:...   | 1234   | 111      | 8
2016-04-06 18:36:...   | 1234   | 222      | 9
2016-04-05 16:36:...   | 4567   | 111      | 1
2016-04-06 17:35:...   | 4567   | 222      | 5
2016-04-06 18:50:...   | 4567   | 111      | 8
2016-04-06 19:36:...   | 4567   | 222      | 9

I need to find the max(datetime) groupby userid,memberid. When I tried as below:

df2 = df.groupBy('userId','memberId').max('datetime')

I'm getting error as:

org.apache.spark.sql.AnalysisException: "datetime" is not a numeric
column. Aggregation function can only be applied on a numeric column.;

The output I desired is as follows:

userId | memberId | datetime
1234   |  111     | 2016-04-06 17:50:...
1234   |  222     | 2016-04-06 18:36:...
4567   |  111     | 2016-04-06 18:50:...
4567   |  222     | 2016-04-06 19:36:...

Can someone please help me how I get the max date among the given data using PySpark dataframes?

score 67 · Accepted Answer · answered Jul 14 '16 at 15:28

67

For non-numeric but Orderable types you can use agg with max directly:

from pyspark.sql.functions import col, max as max_

df = sc.parallelize([
    ("2016-04-06 16:36", 1234, 111, 1),
    ("2016-04-06 17:35", 1234, 111, 5),
]).toDF(["datetime", "userId", "memberId", "value"])

(df.withColumn("datetime", col("datetime").cast("timestamp"))
    .groupBy("userId", "memberId")
    .agg(max_("datetime")))

## +------+--------+--------------------+
## |userId|memberId|       max(datetime)|
## +------+--------+--------------------+
## |  1234|     111|2016-04-06 17:35:...|
## +------+--------+--------------------+

answered Jul 14 '16 at 15:28

zero323

322,348
103
959
935

1

How can I calculate the max (date) and min(date) of a date column? The column type is DateType(). Thanks! @zero323 – Elsa Li Apr 27 '18 at 19:15
2

@ElsaLi `from pyspark.sql.functions import min as min_; df.withColumn.groupBy(...).agg(min_("datetime"), max_("datetime"))` – 10465355 Feb 13 '19 at 12:47
4

Is there a reason why we need to use the `_` instead of just ```max('datetime')``` ? – thentangler May 18 '20 at 03:56
3

@thentangler the max_ is an alias of the max funtion from pyspark itself (see import statement in code). If you would just do max('datetime'), it would use the python function 'max' which was not made to work with columns. – Lennart Sep 30 '20 at 12:23
@zero323 , @Lennart : why did it remove the column `value` from the original df? – BigDataLearner Jun 01 '21 at 09:41
@zero323 , @Lennart : How to keep the column `value` from the original df? – BigDataLearner Jun 01 '21 at 09:52
@BigDataLearner Also ran into this issue. Think that here it's explained https://stackoverflow.com/questions/34409875/how-to-get-other-columns-when-using-spark-dataframe-groupby – user2177768 Jun 01 '21 at 19:13
2

@user2177768 thanks for pointing it out, is it possible to retain the column `value` without doing a join with the original df – BigDataLearner Jun 02 '21 at 01:09
@BigDataLearner did you find your answer? – Mahsa Seifikar Feb 23 '23 at 19:29

how to get max(date) from given set of data grouped by some fields using pyspark?

1 Answers1

Linked

Related