comparing two unordered lists and finding which elements are mismatched in pyspark

Question

Furthering the question from this post here, I have two unordered lists and would like to find if they are equal taking duplicates into account and not caring about the order. And if they are not equal, to find which elements from which list is not in the other one.

Taking the example from the post mentioned above, assuming the list on the left of the equals signs is L1 and list on the right is L2,

    L1                               L2
['one', 'two', 'three'] == ['one', 'two', 'three'] :  true
['one', 'two', 'three'] == ['one', 'three', 'two'] :  true
['one', 'two', 'three'] == ['one', 'two', 'three', 'three'] :  false, L1:'three'
['one', 'two', 'three'] == ['one', 'two', 'three', 'four'] :  false, L1:'four'
['one', 'two', 'three'] == ['one', 'two', 'four'] :  false, L1:'four', L2:'three'
['one', 'two', 'three'] == ['one'] :  false, L2:'two','three'

The output does not have to be exactly like I have depicted, but basically I would like to know if the comparison of two lists are true or false, and if false which elements in L2 are not in L1 and which elements in L1 are not in L2.

The solution provided by @Katriel was to use the collections function like so:

import collections
compare = lambda x, y: collections.Counter(x) == collections.Counter(y)

But it doesn't provide information on which elements are mismatched. Is there an efficient method to do this in pyspark?

Yes, The examples above are for distinct scenarios of L1/L2 comparisons — thentangler, Mar 04 '20 at 03:37
for row 3, the output is false and "three". Is handling of duplicates like that completely necessary for your use case? — murtihash, Mar 04 '20 at 03:49
Not really. I guess I can check just for duplicates via another method. So no, I dont need to detect duplicates like that, we can drop duplicates before the compare — thentangler, Mar 04 '20 at 04:13

score 2 · Accepted Answer · answered Mar 04 '20 at 04:23

If retaining duplicate matches like row3 in your example is not necessary, then you can use double array_except with concat to get your output, and use 'when,otherwise' to get your boolean based on size of array produced. (spark2.4+)

df.withColumn("result", F.concat(F.array_except("L1","L2"),F.array_except("L2","L1")))\
  .withColumn("Boolean", F.when(F.size("result")==0,F.lit(True)).otherwise(F.lit(False))).show(truncate=False)

+-----------------+------------------------+-------------+-------+
|L1               |L2                      |result       |Boolean|
+-----------------+------------------------+-------------+-------+
|[one, two, three]|[one, two, three]       |[]           |true   |
|[one, two, three]|[one, three, two]       |[]           |true   |
|[one, two, three]|[one, two, three, three]|[]           |true   |
|[one, two, three]|[one, two, three, four] |[four]       |false  |
|[one, two, three]|[one, two, four]        |[three, four]|false  |
|[one, two, three]|[one]                   |[two, three] |false  |
+-----------------+------------------------+-------------+-------+

You wouldn't need to drop duplicates before compare as array_except does that for you automatically.

Thank You! It is amazing that the array_except can identify the mismatch in a single go. Is there a way to add the origin list next to the mismatch or in a separate column? — thentangler, Mar 05 '20 at 17:53
glad to help. what do you mean by origin list? like both l1 and l2? — murtihash, Mar 05 '20 at 19:29
Hi, Yes, Basically I would like to know from which list which mismatch occurred. So in this case for row 4, the result could be `[three:L2, four:L1]` or perhaps in a another column where is would be [L2, L1] — thentangler, Mar 09 '20 at 15:13

comparing two unordered lists and finding which elements are mismatched in pyspark

1 Answers1