1

These are the values of my dateframe:

+-------+----------+
|     ID| Date_Desc|
+-------+----------+
|8951354|2012-12-31|
|8951141|2012-12-31|
|8952745|2012-12-31|
|8952223|2012-12-31|
|8951608|2012-12-31|
|8950793|2012-12-31|
|8950760|2012-12-31|
|8951611|2012-12-31|
|8951802|2012-12-31|
|8950706|2012-12-31|
|8951585|2012-12-31|
|8951230|2012-12-31|
|8955530|2012-12-31|
|8950570|2012-12-31|
|8954231|2012-12-31|
|8950703|2012-12-31|
|8954418|2012-12-31|
|8951685|2012-12-31|
|8950586|2012-12-31|
|8951367|2012-12-31|
+-------+----------+

I tried to create a median value of a date column in pyspark:

df1 = df1.groupby('Date_Desc').agg(f.expr('percentile(ID, array(0.25))')[0].alias('%25'),
                             f.expr('percentile(ID, array(0.50))')[0].alias('%50'),
                             f.expr('percentile(ID, array(0.75))')[0].alias('%75')) 

But I get this an error:

Py4JJavaError: An error occurred while calling o198.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 29.0 failed 1 times, most recent failure: Lost task 1.0 in stage 29.0 (TID 427, 5bddc801333f, executor driver): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '11/23/04 9:00' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.

blackbishop
  • 30,945
  • 11
  • 55
  • 76
  • 1
    Does this answer your question? [to\_date fails to parse date in Spark 3.0](https://stackoverflow.com/questions/62943941/to-date-fails-to-parse-date-in-spark-3-0) – mck Mar 17 '21 at 13:54
  • The error could not be reproduced with the provided sample data and code snippet. It's probably coming from some date parsing before this transformation. You can refer to the linked question. – blackbishop Mar 17 '21 at 15:21
  • Thank you very much, i was able to sucessfully change the data column after following the procedures on the link above. – michelfelippin Mar 18 '21 at 01:17

1 Answers1

0

With Spark ≥ 3.1.0 :

from pyspark.sql.functions import percentile_approx

df1.groupBy("Date_Desc").agg(percentile_approx("ID", 0.5).alias("%50"))
00schneider
  • 698
  • 9
  • 21