pyspark dataframe get second lowest value for each row

Question

I would like to query, if anyone has an idea, how to get the second lowest value in a row of Dataframe in pyspark.

For example:

Input Dataframe:

Col1  Col2  Col3  Col4 
83    32    14    62   
63    32    74    55   
13    88     6    46

Expected output:

Col1  Col2  Col3  Col4 Res
83    32    14    62   32   
63    32    74    55   55   
13    88     6    46   13

https://stackoverflow.com/a/43188621/797495 – Pedro Lobito Mar 02 '20 at 20:54 — Pedro Lobito, Mar 02 '20 at 20:54

notNull · Accepted Answer · 2020-03-03T02:33:47.683

We can use concat_ws function to concat all columns for the row then use split to create an array.

use array_sort function to sort with in the array and extract second element[1] of the array.

Example:

from pyspark.sql.functions import *

df=spark.createDataFrame([('83','32','14','62'),('63','32','74','55'),('13','88','6','46')],['Col1','Col2','Col3','Col4'])

df.selectExpr("array_sort(split(concat_ws(',',Col1,Col2,Col3,Col4),','))[1] Res").show()

#+---+
#|Res|
#+---+
#|32 |
#|55 |
#|13 |
#+---+

More Dynamic Way:

df.selectExpr("array_sort(split(concat_ws(',',*),','))[1]").show()

#+---+
#|Res|
#+---+
#|32 |
#|55 |
#|13 |
#+---+

EDIT:

#adding Res column to the dataframe
df1=df.selectExpr("*","array_sort(split(concat_ws(',',*),','))[1] Res")
df1.show()

#+----+----+----+----+---+
#|Col1|Col2|Col3|Col4|Res|
#+----+----+----+----+---+
#|  83|  32|  14|  62| 32|
#|  63|  32|  74|  55| 55|
#|  13|  88|   6|  46| 46|
#+----+----+----+----+---+

awesome! do you know how to append new column Res to the dataframe — Drachens, Mar 03 '20 at 02:17
@Drachens, Sure!, please check my Updated answer **`EDIT`** section! — notNull, Mar 03 '20 at 02:34
You don't need to concatenate the columns and then split to get an array. You can simply use `array(*cols)`. — blackbishop, Mar 03 '20 at 12:52

blackbishop · Answer 2 · 2020-03-03T12:46:40.317

You can create an array column using array function and then sort it using array_sort. Finally, get the second element using element_at. These 2 last functions are available from Spark 2.4+.

df.withColumn("res", element_at(array_sort(array(*[col(c) for c in df.columns])), 2))\
  .show()

#+----+----+----+----+---+
#|Col1|Col2|Col3|Col4|res|
#+----+----+----+----+---+
#|83  |32  |14  |62  |32 |
#|63  |32  |74  |55  |55 |
#|13  |88  |6   |46  |13 |
#+----+----+----+----+---+

Another way of doing is to use least function. First, calculate the min value from all columns then calculate another time the least from values that are greater than min using when expression:

df.withColumn("min", least(*[col(c) for c in df.columns]))\
  .withColumn("res", least(*[when(col(c) > col("min"), col(c)) for c in df.columns]))\
  .drop("min")\
  .show()

pyspark dataframe get second lowest value for each row

2 Answers2