I have a DataFrame created from joining three other dataframes together that I need to do many sum case statements to. I found the withcolumn() function may be my solution but when I do a comparison on a column that has a char I get two wildly different numbers when I use either .contains() or '=='.
The below code shows how both are being used- Here is the code with .contains:
test_data.withColumn('NewCol', F.when((test_data.col1 == 'str') & ((test_data.col2.contains('X')) | (test_data.col2.contains('Y'))) ,1).otherwise(0)) \
.groupBy('KEY') \
.agg(F.sum('NewCol').alias('NewCol))
The resulting count of this is 127470 rows > 0
And here is the code with the comparator:
test_data.withColumn('NewCol', F.when((test_data.col1 == 'str') & ((test_data.col2 == 'X') | (test_data.col2 == 'Y')) ,1).otherwise(0)) \
.groupBy('KEY') \
.agg(F.sum('NewCol').alias('NewCol))
and the resulting count of this is 531645 rows > 0.
As you can see the number of rows that have a 1 is much larger when using the comparator vs .contains(). Is .contains() filtering before doing the comparison or is there something else causing the wildly different numbers?