1

I'm trying to create a flag in my dataset based on 2 conditions, the first is simple. Does CheckingCol = CheckingCol2.

The second is more complicated. I have a column called TranID and a column called RevID.

For nay row if RevID is in TranID AND CheckingCol = CheckingCol2 then the flag should return "Yes". Otherwise the flag should return "No".

My data looks like this:

TranID   RevID   CheckingCol  CheckingCol2
1        2       ABC          ABC
2        1       ABC          ABC
3        6       ABCDE        ABCDE
4        3       ABCDE        ABC
5        7       ABCDE        ABC

The expected result would be:

TranID   RevID   CheckingCol  CheckingCol2  Flag
1        2       ABC          ABC           Yes
2        1       ABC          ABC           Yes
3        6       ABCDE        ABCDE         No
4        3       ABCDE        ABC           No
5        7       ABCDE        ABC           No

I've tried using:

df.withColumn("TotalMatch", when((col("RevID").contains(col("TranID"))) & (col("CheckingColumn") == col("CheckingColumn2")), "Yes").otherwise("No"))

But it didn't work, and I've not been able to find anything online about how to do this.

Any help would be great!

Greggary
  • 25
  • 4
  • `F.lit("Yes")` instead of `"Yes"` and `F.lit("No")` instead of `"No"` – murtihash Apr 17 '20 at 16:52
  • @pault contains does work, but i realized the problem is that OP has a parent-child type of problem, between `TranID` and `RevID`, thats why first 2 rows are flagged as `Yes` – murtihash Apr 17 '20 at 20:31
  • @pault therefore, in my humble opinion, I dont think its a `duplicate` – murtihash Apr 17 '20 at 20:34

1 Answers1

0

Obtain the unique values as array from the TranID column, then check for the RevID from that array using isIn() function

from pyspark.sql import functions as sf
unique_values = df1.agg(sf.collect_set("TranID").alias("uniqueIDs"))
unique_values.show()
+---------------+
|       uniqueIDs|
+---------------+
|[3, 1, 2, 5, 4]|
+---------------+

required_array = unique_values.take(1)[0].uniqueIDs
['3', '1', '2', '5', '4']

df2 = df1.withColumn("Flag", sf.when( (sf.col("RevID").isin(required_array) & (sf.col("CheckingCol") ==sf.col("CheckingCol2")) ) , "Yes").otherwise("No"))

Note: Check for the nulls and NoneType values in both RevID and TranID columns since they will affect the results

HArdRe537
  • 116
  • 5