0

I ran this code below, and it return a single value for salary. However, I was hoping it would also return other columns associated with the max value, so I know which categories contain the maximum value over the dataset.

df.groupBy().max('Salary').show() 

I tried df.select('Company','Employee','Salary').groupBy().max('Salary').show(), but the result does not output Company and Employee. enter image description here

user147271
  • 145
  • 2
  • 6
  • 2
    What you do is a direct aggregation `df.agg(F.max("value")).show()` why do you use groupby? , please be clearer as to what you intend to do. Refer to [How to make good reproducible Apache Spark examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples) and edit your question body – anky Aug 18 '20 at 15:57
  • select after groupby and max. – Lamanus Aug 18 '20 at 16:21
  • I am almost sure you need a window+ max and a filter here if I am not wrong, but this example is not reproducible and minimalist – anky Aug 18 '20 at 16:25
  • @anky I tried using your method, but it still showing one column for the maximum value. What I am trying to achieve is extracting the maximum value with other columns associated with it. Apologies, I am quite new to this, so i will try do better next time. – user147271 Aug 18 '20 at 17:20
  • 1
    @user147271 is this what you're looking for? [GroupBy column and filter rows with maximum value in Pyspark](https://stackoverflow.com/questions/48829993/groupby-column-and-filter-rows-with-maximum-value-in-pyspark). Please [don't post pictures of code or data](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-errors-when-asking-a-question?noredirect=1). – pault Aug 18 '20 at 17:31

1 Answers1

0

Option1: Using window function for this case and filtering out only the max value.

df.show()
#+----+----+-----+
#|col1|col2|value|
#+----+----+-----+
#|   1|   a|   10|
#|   2|   b|   20|
#+----+----+-----+

from pyspark.sql.functions import *
from pyspark.sql import *
import sys

w=Window.orderBy("value").rowsBetween(-sys.maxsize,sys.maxsize)

df.withColumn("mx",max(col("value")).over(w)).\
filter(expr('value == mx')).\
drop("mx").\
show()

#+----+----+-----+
#|col1|col2|value|
#+----+----+-----+
#|   2|   b|   20|
#+----+----+-----+

Option2: Without using window function.

#get the max value then store into variable
max_val=df.agg(max("value")).collect()[0][0]

df.filter(col("value") == max_val).show()
#+----+----+-----+
#|col1|col2|value|
#+----+----+-----+
#|   2|   b|   20|
#+----+----+-----+
notNull
  • 30,258
  • 4
  • 35
  • 50