1

Furthering the question from this post here, I have two unordered lists and would like to find if they are equal taking duplicates into account and not caring about the order. And if they are not equal, to find which elements from which list is not in the other one.

Taking the example from the post mentioned above, assuming the list on the left of the equals signs is L1 and list on the right is L2,

    L1                               L2
['one', 'two', 'three'] == ['one', 'two', 'three'] :  true
['one', 'two', 'three'] == ['one', 'three', 'two'] :  true
['one', 'two', 'three'] == ['one', 'two', 'three', 'three'] :  false, L1:'three'
['one', 'two', 'three'] == ['one', 'two', 'three', 'four'] :  false, L1:'four'
['one', 'two', 'three'] == ['one', 'two', 'four'] :  false, L1:'four', L2:'three'
['one', 'two', 'three'] == ['one'] :  false, L2:'two','three'

The output does not have to be exactly like I have depicted, but basically I would like to know if the comparison of two lists are true or false, and if false which elements in L2 are not in L1 and which elements in L1 are not in L2.

The solution provided by @Katriel was to use the collections function like so:

import collections
compare = lambda x, y: collections.Counter(x) == collections.Counter(y)

But it doesn't provide information on which elements are mismatched. Is there an efficient method to do this in pyspark?

ZygD
  • 22,092
  • 39
  • 79
  • 102
thentangler
  • 1,048
  • 2
  • 12
  • 38

1 Answers1

2

If retaining duplicate matches like row3 in your example is not necessary, then you can use double array_except with concat to get your output, and use 'when,otherwise' to get your boolean based on size of array produced. (spark2.4+)

df.withColumn("result", F.concat(F.array_except("L1","L2"),F.array_except("L2","L1")))\
  .withColumn("Boolean", F.when(F.size("result")==0,F.lit(True)).otherwise(F.lit(False))).show(truncate=False)

+-----------------+------------------------+-------------+-------+
|L1               |L2                      |result       |Boolean|
+-----------------+------------------------+-------------+-------+
|[one, two, three]|[one, two, three]       |[]           |true   |
|[one, two, three]|[one, three, two]       |[]           |true   |
|[one, two, three]|[one, two, three, three]|[]           |true   |
|[one, two, three]|[one, two, three, four] |[four]       |false  |
|[one, two, three]|[one, two, four]        |[three, four]|false  |
|[one, two, three]|[one]                   |[two, three] |false  |
+-----------------+------------------------+-------------+-------+

You wouldn't need to drop duplicates before compare as array_except does that for you automatically.

murtihash
  • 8,030
  • 1
  • 14
  • 26
  • Thank You! It is amazing that the array_except can identify the mismatch in a single go. Is there a way to add the origin list next to the mismatch or in a separate column? – thentangler Mar 05 '20 at 17:53
  • glad to help. what do you mean by origin list? like both l1 and l2? – murtihash Mar 05 '20 at 19:29
  • Hi, Yes, Basically I would like to know from which list which mismatch occurred. So in this case for row 4, the result could be `[three:L2, four:L1]` or perhaps in a another column where is would be [L2, L1] – thentangler Mar 09 '20 at 15:13