I am using Spark and Scala for learning purpose. I came around a situation wherein I need to compare the validity of records present in one of the columns of spark dataframe. This is how I created one dataframe, "dataframe1":
import sparkSession.implicits._
val dataframe1 = Seq("AB","BC","CD","DA","AB","BC").toDF("col1")
dataframe1:
+----+
|col1|
+----+
| AB|
| BC|
| CD|
| DA|
| AB|
| BC|
+----+
The validity of records depends on the condition if the record is "AB" or "BC". Here is my first attempt:
val dataframe2 = dataframe1.withColumn("col2", when('col1.contains("AB") or 'col1.contains("BC"), "valid").otherwise("invalid"))
dataframe2:
+----+-------+
|col1| col2|
+----+-------+
| AB| valid|
| BC| valid|
| CD|invalid|
| DA|invalid|
| AB| valid|
| BC| valid|
+----+-------+
But I don't think this is a good way of doing because if I need to add more valid records then I need to add conditions in "when" clause which will increase the code length and disturbs the code readability.
So I tried to put all the valid records in one list and check if the record string is present in the list. If it is present then it is a valid record otherwise not. Here is the code snippet for this trial:
val validRecList = Seq("AB", "BC").toList
val dataframe3 = dataframe1.withColumn("col2", if(validRecList.contains('col1.toString())) lit("valid") else lit("invalid"))
But somehow it is not working as expected, as the result of this is:
+----+-------+
|col1| col2|
+----+-------+
| AB|invalid|
| BC|invalid|
| CD|invalid|
| DA|invalid|
| AB|invalid|
| BC|invalid|
+----+-------+
Can anybody tell me what mistake am I doing here? And, any other generic suggestion for such a scenario. Thank you.