I have a Spark dataframe (articleDF1) below, I am trying to add two columns Start and End date using the Date column to the Dataframe and grouping the resulting dataframe by post_evar10. The final Dataframe will have post_evar10, Start Date and End date
-------+--------------------+
| Date| post_evar10|
+----------+--------------------+
|2019-09-02|www:/espanol/recu...|
|2019-09-02|www:/caregiving/h...|
|2019-12-15|www:/health/condi...|
|2019-09-01|www:/caregiving/h...|
|2019-08-31|www:/travel/trave...|
|2020-01-20|www:/home-family/...|
What I have tried:
from pyspark.sql import functions as f
articleDF3 = articleDF1.withColumn('Start_Date', f.min(f.col('Date'))).withColumn('Start_Date', f.max(f.col('Date'))).groupBy(f.col("post_evar10")).drop("Date")
Getting Error:
org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and 'temp.ms_article_lifespan_final.Date
' is not an aggregate function. Wrap '(min(temp.ms_article_lifespan_final.Date
) AS Start_Date
)' in windowing function(s) or wrap 'temp.ms_article_lifespan_final.Date
' in first() (or first_value) if you don't care which value you get.;;