-1

I am trying to fill a new column with Y or N. To do that I check two columns and if one of this has a True value, I put Y in the new column else put a N.

For example, I have this dataframe:

+--------------+----------+----------+------------------+--------+--------+-------------------+
|Date          |Col1      | Col2     |ChangeinCol1_Col2 | Col3   | Col4   | ChangeinCol3_Col4 |
+--------------+----------+----------+------------------+--------+--------+-------------------+
|2020-12-14    |True      | False    |     Y            | False  | False  |       N           |
|2020-12-14    |False     | False    |     Y            | False  | False  |       N           |

If there is a True in Col1 or Col2, the column ChangeinCol1_Col2 will be Y, same for ChangeinCol3_Col4 but in this case there are N because there are not changes in Col3 and Col4.

How could I do this with Apache Spark in Scala? I am trying it with df.withColumn to create the new column but don't know how to check the value in the cols.

MLstudent
  • 89
  • 6

1 Answers1

1

You can use when:

import org.apache.spark.sql.expressions.Window

val df2 = df.withColumn(
    "ChangeinCol1_Col2",
    when(max($"Col1").over(Window.orderBy()) || max($"Col2").over(Window.orderBy()), lit("Y")).otherwise(lit("N"))
).withColumn(
    "ChangeinCol3_Col4",
    when(max($"Col3").over(Window.orderBy()) || max($"Col4").over(Window.orderBy()), lit("Y")).otherwise(lit("N"))
)
mck
  • 40,932
  • 13
  • 35
  • 50
  • But if what I have to check is if the col is True or False, if have a True, I put Y, else if the col only have Falses values, put N. – MLstudent Feb 09 '21 at 12:23
  • @MLstudent are the columns of boolean type? If yes, you can put them as a condition directly. – mck Feb 09 '21 at 12:23
  • Btw thank you a lot for your help. In this way I only compare the value in the row, I have to check if for example, col1 or col2 have in any row a True, then all the new column, in this case, ChangeinCol1_Col2 will be Y, else N. – MLstudent Feb 09 '21 at 13:13
  • @MLstudent then you can use `max` instead. see edited answer. – mck Feb 09 '21 at 13:17
  • Wow! is working! You have helped me a lot! and is not the first time, You helped me in the last question too! Could you explain me a little the performance of max and over here? And if you can, would you recommend me a course of Spark to learn this or something? Thanks a lot again! – MLstudent Feb 09 '21 at 13:30