How to find a median value in pyspark

Question

These are the values of my dateframe:

+-------+----------+
|     ID| Date_Desc|
+-------+----------+
|8951354|2012-12-31|
|8951141|2012-12-31|
|8952745|2012-12-31|
|8952223|2012-12-31|
|8951608|2012-12-31|
|8950793|2012-12-31|
|8950760|2012-12-31|
|8951611|2012-12-31|
|8951802|2012-12-31|
|8950706|2012-12-31|
|8951585|2012-12-31|
|8951230|2012-12-31|
|8955530|2012-12-31|
|8950570|2012-12-31|
|8954231|2012-12-31|
|8950703|2012-12-31|
|8954418|2012-12-31|
|8951685|2012-12-31|
|8950586|2012-12-31|
|8951367|2012-12-31|
+-------+----------+

I tried to create a median value of a date column in pyspark:

df1 = df1.groupby('Date_Desc').agg(f.expr('percentile(ID, array(0.25))')[0].alias('%25'),
                             f.expr('percentile(ID, array(0.50))')[0].alias('%50'),
                             f.expr('percentile(ID, array(0.75))')[0].alias('%75'))

But I get this an error:

Py4JJavaError: An error occurred while calling o198.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 29.0 failed 1 times, most recent failure: Lost task 1.0 in stage 29.0 (TID 427, 5bddc801333f, executor driver): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '11/23/04 9:00' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.

Does this answer your question? [to\_date fails to parse date in Spark 3.0](https://stackoverflow.com/questions/62943941/to-date-fails-to-parse-date-in-spark-3-0) — mck, Mar 17 '21 at 13:54
The error could not be reproduced with the provided sample data and code snippet. It's probably coming from some date parsing before this transformation. You can refer to the linked question. — blackbishop, Mar 17 '21 at 15:21
Thank you very much, i was able to sucessfully change the data column after following the procedures on the link above. — michelfelippin, Mar 18 '21 at 01:17

score 0 · Answer 1 · answered Mar 17 '21 at 16:07

0

With Spark ≥ 3.1.0 :

from pyspark.sql.functions import percentile_approx

df1.groupBy("Date_Desc").agg(percentile_approx("ID", 0.5).alias("%50"))

answered Mar 17 '21 at 16:07

00schneider

698
9
21

How to find a median value in pyspark

1 Answers1