0

I have a Pyspark dataframe

x1 x2
12 4
8 5
13 2

I would like to cap x1 = 10 for the rows with x2 < 5, something like:

if x2 < 5:
  if x1 > 10:
    x1 = 10

How could I do that for Pyspark?

Many thanks

mommomonthewind
  • 4,390
  • 11
  • 46
  • 74
  • Possible duplicate of https://stackoverflow.com/questions/40161879/pyspark-withcolumn-with-two-conditions-and-three-outcomes – giser_yugang Apr 22 '19 at 09:42
  • Possible duplicate of [PySpark: withColumn() with two conditions and three outcomes](https://stackoverflow.com/questions/40161879/pyspark-withcolumn-with-two-conditions-and-three-outcomes) – pault Apr 22 '19 at 13:39

1 Answers1

0

this is the base logic:

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.when

from pyspark.sql.functions import when

df = spark.createDataFrame([(12, 4), (8, 5), (13, 2)]).toDF("x1", "x2")

df\
.withColumn("logic", when(df.x2 < 5, 10)\
            .otherwise(when(df.x1 > 10, 10)))\
.show()

+---+---+-----+
| x1| x2|logic|
+---+---+-----+
| 12|  4|   10|
|  8|  5| null|
| 13|  2|   10|
+---+---+-----+

// other logic

from pyspark.sql.functions import when, lit

df\
.withColumn("logic", when((df.x2 < 5) & (df.x1 > 10), lit(10))\
            .otherwise(df.x1))\
.show()

+---+---+-----+
| x1| x2|logic|
+---+---+-----+
| 12|  4|   10|
|  8|  5|    8|
| 13|  2|   10|
+---+---+-----+


thePurplePython
  • 2,621
  • 1
  • 13
  • 34
  • I think your logic is incorrect. it should be `when((df.x2 < 5)&(df.x1 > 10), lit(10)).otherwise(df.x1))` or `when(df.x2 < 5, least(df.x1, lit(10))).otherwise(df.x1))` (after importing `lit`, `when`, and `least` from `pyspark.sql.functions`) – pault Apr 22 '19 at 18:46
  • ok thanks i updated it ... the original question wasn't clear to me – thePurplePython Apr 22 '19 at 19:39