Pyspark - Return selected columns after groupBy. and agg

Question

I ran this code below, and it return a single value for salary. However, I was hoping it would also return other columns associated with the max value, so I know which categories contain the maximum value over the dataset.

df.groupBy().max('Salary').show()

I tried df.select('Company','Employee','Salary').groupBy().max('Salary').show(), but the result does not output Company and Employee.

What you do is a direct aggregation `df.agg(F.max("value")).show()` why do you use groupby? , please be clearer as to what you intend to do. Refer to [How to make good reproducible Apache Spark examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples) and edit your question body — anky, Aug 18 '20 at 15:57
I am almost sure you need a window+ max and a filter here if I am not wrong, but this example is not reproducible and minimalist — anky, Aug 18 '20 at 16:25
@anky I tried using your method, but it still showing one column for the maximum value. What I am trying to achieve is extracting the maximum value with other columns associated with it. Apologies, I am quite new to this, so i will try do better next time. — user147271, Aug 18 '20 at 17:20
@user147271 is this what you're looking for? [GroupBy column and filter rows with maximum value in Pyspark](https://stackoverflow.com/questions/48829993/groupby-column-and-filter-rows-with-maximum-value-in-pyspark). Please [don't post pictures of code or data](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-errors-when-asking-a-question?noredirect=1). — pault, Aug 18 '20 at 17:31

score 0 · Answer 1 · answered Aug 18 '20 at 21:53

Option1: Using window function for this case and filtering out only the max value.

df.show()
#+----+----+-----+
#|col1|col2|value|
#+----+----+-----+
#|   1|   a|   10|
#|   2|   b|   20|
#+----+----+-----+

from pyspark.sql.functions import *
from pyspark.sql import *
import sys

w=Window.orderBy("value").rowsBetween(-sys.maxsize,sys.maxsize)

df.withColumn("mx",max(col("value")).over(w)).\
filter(expr('value == mx')).\
drop("mx").\
show()

#+----+----+-----+
#|col1|col2|value|
#+----+----+-----+
#|   2|   b|   20|
#+----+----+-----+

Option2: Without using window function.

#get the max value then store into variable
max_val=df.agg(max("value")).collect()[0][0]

df.filter(col("value") == max_val).show()
#+----+----+-----+
#|col1|col2|value|
#+----+----+-----+
#|   2|   b|   20|
#+----+----+-----+

Pyspark - Return selected columns after groupBy. and agg

1 Answers1