I have two lists of objects. Let's call the lists a and b. The objects (for our intents and purposes) are defined as below:
class MyObj:
def __init__(self, string: str, integer: int):
self.string = string
self.integer = integer
def __eq__(self, other):
if self.integer == other.integer:
pass
else:
return False
if fuzz.ratio(self.string, other.string) > 90: # fuzzywuzzy library checks if strings are "similar enough"
return True
else:
return False
Now what I want to achieve is to check which objects in list a are "in" list b (return true against == when compared to some object in list b).
Currently I'm just looping through them as follows:
for obj in a:
for other_obj in b:
if a == b:
<do something>
break
I strongly suspect that there is a faster way of implementing this. The lists are long. Up to like 100 000 objects each. So this is a big bottleneck in my code.
I looked at this answer Fastest way to search a list in python and it suggests that sets work much better. I'm a bit confused by this though:
How significant is the "removal of duplicates" speedup? I don't expect to have many duplicates in my lists.
Can sets remove duplicates and properly hash when I have defined the eq the way I have?
- How would this compare with pre-ordering the list, and using something like binary search? A set is unordered...
So what is the best approach here? Please provide implementation guidelines in the answer as well.