1

I used code belopw to sort based on one column. I am wondering how can I get the first element and last element in sorted dataframe?

 group_by_dataframe
        .count()
        .filter("`count` >= 10")
        .sort(desc("count"))
kevin
  • 309
  • 2
  • 12
  • 3
    there's `pyspark.sql.functions.min` and `pyspark.sql.functions.max` as well as `pyspark.sql.functions.first` and `pyspark.sql.functions.last`. It would be helpful if you could provide a small [reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples). – pault Aug 22 '19 at 14:17

1 Answers1

1

The max and min functions need to have a group to work with, to circumvent the issue, you can create a dummy column as below, then call the max and min for the maximum and minimum values.

If that's all you need, you don't really need sort here.

from pyspark.sql import functions as F
df = spark.createDataFrame([("a", 0.694), ("b", -2.669), ("a", 0.245), ("a", 0.1), ("b", 0.3), ("c", 0.3)], ["n", "val"])
df.show()

+---+------+
|  n|   val|
+---+------+
|  a| 0.694|
|  b|-2.669|
|  a| 0.245|
|  a|   0.1|
|  b|   0.3|
|  c|   0.3|
+---+------+


df = df.groupby('n').count() #.sort(F.desc('count'))
df = df.withColumn('dummy', F.lit(1))
df.show()

+---+-----+-----+
|  n|count|dummy|
+---+-----+-----+
|  c|    1|    1|
|  b|    2|    1|
|  a|    3|    1|
+---+-----+-----+


df = df.groupBy('dummy').agg(F.min('count').alias('min'), F.max('count').alias('max')).drop('dummy')
df.show()

+---+---+
|min|max|
+---+---+
|  1|  3|
+---+---+

niuer
  • 1,589
  • 2
  • 11
  • 14