-3

I have a DataFrame created from joining three other dataframes together that I need to do many sum case statements to. I found the withcolumn() function may be my solution but when I do a comparison on a column that has a char I get two wildly different numbers when I use either .contains() or '=='.

The below code shows how both are being used- Here is the code with .contains:

test_data.withColumn('NewCol', F.when((test_data.col1 == 'str') & ((test_data.col2.contains('X')) | (test_data.col2.contains('Y'))) ,1).otherwise(0)) \
                .groupBy('KEY') \
                .agg(F.sum('NewCol').alias('NewCol))

The resulting count of this is 127470 rows > 0

And here is the code with the comparator:

test_data.withColumn('NewCol', F.when((test_data.col1 == 'str') & ((test_data.col2 == 'X') | (test_data.col2 == 'Y')) ,1).otherwise(0)) \
                .groupBy('KEY') \
                .agg(F.sum('NewCol').alias('NewCol))

and the resulting count of this is 531645 rows > 0.

As you can see the number of rows that have a 1 is much larger when using the comparator vs .contains(). Is .contains() filtering before doing the comparison or is there something else causing the wildly different numbers?

  • 1
    `==` is an exact match of the entire column, `contains()` matches a substring. – Barmar May 09 '19 at 18:40
  • Your parenthesis grouping is not the same in the two queries, that explains the big difference. – Barmar May 09 '19 at 18:43
  • Can you edit or tell me where they are different? Because I do not see where the differ between the two. The two statements after the '&' are grouped together as the or only for them and the statement before the '&' is by itself. And all of those are grouped within the when statement where they should be. Am I missing something? Thanks! – DeveloperGuy May 09 '19 at 18:50
  • @Barmar That makes sense but I don't think it really solves my problem. The dataset I'm working with is hundreds of millions of entries and I have it limited to only one million for testing. The column that is being compared is a column that only contains a char. This column only has one of four possible characters so even though that's the case the resulting count should be the same shouldn't it? – DeveloperGuy May 09 '19 at 18:53
  • Changing the parentheses changes the logic of the matching, because of the precedence of `&&` and `||`. – Barmar May 09 '19 at 18:54
  • Never mind, I misread the queries before. You have so many extra parentheses that make it hard to see where they match up. – Barmar May 09 '19 at 19:01
  • I suggest you try it with a smaller dataset, like 100 rows, and look at the difference in results. – Barmar May 09 '19 at 19:01
  • @Barmar Yeah, I know its a tad confusing. But the codebase I'm translating over to python and Pyspark has a 350+ select sum(case when) statement that I'm trying to find a way of converting nicely and I just can't see a nice way besides the withColumn() I'll break down the dataset and see if I can find something. Thanks! – DeveloperGuy May 09 '19 at 19:07
  • I'm just saying that you don't need the parentheses around `(test_data.col1 == 'str')` and `(test_data.col2.contains('X'))` – Barmar May 09 '19 at 19:08
  • Use the parentheses that are important for grouping the `&` and `|` operators, but don't put parentheses around the individual comparisons. – Barmar May 09 '19 at 19:09
  • You're right I can take those out. Thanks for pointing that out. I think I just had them in there for when I was testing between the two options I mentioned. – DeveloperGuy May 09 '19 at 19:11
  • @DeveloperGuy I think you will need to provide a small [reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples) for us to help you debug. – pault May 09 '19 at 19:22

1 Answers1

0

== matches the entire string .contains() search in to the string looking for the string Example: "yesterday"== "day" is false but "yesterday".Contains("day") is true

delfosk9
  • 73
  • 1
  • 9
  • This makes sense but it doesn't solve my problem, which may be my fault with how it's worded? But thank you for answering. The column that is being compared is a column that only contains a char. This column only has one of four possible characters so even though that's the case the resulting count should be the same shouldn't it? There is no string to compare against, just a single char. – DeveloperGuy May 09 '19 at 18:55
  • This doesn't explain why a query that uses `==` returns *more* results than the query that uses the more permissive `contains()`. – Barmar May 09 '19 at 18:55