-1

Currently, I am working with PySpark to analyze some data. I have a CSV file with Payroll data in it. I want to know what Job has the best pay. To do that I need the median() because I want to know the average.

The methods for groupBy in Pyspark are these: agg, avg, count, max, mean, min, pivot, sum

When I try the .mean() method it looks like this:

mean_pay_data = reduced_data.groupBy("JOB_TITLE").mean("REGULAR_PAY")
mean_pay_data.show(3)

# +--------------------+-----------------+
# |           JOB_TITLE| avg(REGULAR_PAY)|
# +--------------------+-----------------+
# |SENIOR SECURITY O...|59818.79285751433|
# |SENIOR TRAFFIC SU...| 72116.8394540951|
# |AIR CONDITIONING ...|98415.21726190476|
# +--------------------+-----------------+

Here is what it looks like with the .avg() method:

average_pay_data = reduced_data.groupBy("JOB_TITLE").avg("REGULAR_PAY")
average_pay_data.show(3)

# +--------------------+-----------------+
# |           JOB_TITLE| avg(REGULAR_PAY)|
# +--------------------+-----------------+
# |SENIOR SECURITY O...|59818.79285751433|
# |SENIOR TRAFFIC SU...| 72116.8394540951|
# |AIR CONDITIONING ...|98415.21726190476|
# +--------------------+-----------------+

They return the exact same values. What's the difference between mean() and avg()?

I also want to find the median, so that one person doesn't have too much of an impact. Since there is no median() method in PySpark I don't know what to do here.

ZygD
  • 22,092
  • 39
  • 79
  • 102
Tzimon
  • 13
  • 1
  • there *is* a median method in pyspark. see [`percentile_approx`](https://spark.apache.org/docs/3.3.0/api/python/reference/pyspark.sql/api/pyspark.sql.functions.percentile_approx.html#pyspark-sql-functions-percentile-approx). as for `mean` and `avg` - they're same. see [func list](https://spark.apache.org/docs/3.3.0/api/python/reference/pyspark.sql/functions.html) – samkart Oct 11 '22 at 09:57

1 Answers1

1

Both avg and mean documentation tell this:

mean() is an alias for avg()

Both of these functions are identical. Both names are needed, so that developers coming from different backgrounds would feel comfortable.

Regarding the median:

  • Approximate (efficient) median: F.expr('percentile_approx(col_name, .5) over()')

  • Accurate (inefficient) median: F.expr('percentile(col_name, .5) over()')

karel
  • 5,489
  • 46
  • 45
  • 50
ZygD
  • 22,092
  • 39
  • 79
  • 102