4

I have two lists of objects. Let's call the lists a and b. The objects (for our intents and purposes) are defined as below:

class MyObj:
    def __init__(self, string: str, integer: int):
        self.string = string
        self.integer = integer

    def __eq__(self, other):
        if self.integer == other.integer:
            pass
        else:
            return False

        if fuzz.ratio(self.string, other.string) > 90: # fuzzywuzzy library checks if strings are "similar enough"
            return True
        else:
            return False

Now what I want to achieve is to check which objects in list a are "in" list b (return true against == when compared to some object in list b).

Currently I'm just looping through them as follows:

for obj in a:
    for other_obj in b:
        if a == b:
            <do something>
            break

I strongly suspect that there is a faster way of implementing this. The lists are long. Up to like 100 000 objects each. So this is a big bottleneck in my code.

I looked at this answer Fastest way to search a list in python and it suggests that sets work much better. I'm a bit confused by this though:

  • How significant is the "removal of duplicates" speedup? I don't expect to have many duplicates in my lists.

  • Can sets remove duplicates and properly hash when I have defined the eq the way I have?

  • How would this compare with pre-ordering the list, and using something like binary search? A set is unordered...

So what is the best approach here? Please provide implementation guidelines in the answer as well.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
Neil
  • 3,020
  • 4
  • 25
  • 48

1 Answers1

3

TL;DR, when using fuzzy comparison techniques, sets and sorting can be very difficult to work with without some normalization method. You can try to be smart about reducing search spaces as much as possible, but care should be taken to do it consistently.

If a class defines __eq__ and not __hash__, it is not hashable.

For instance, consider the following class

class Name:
    def __init__(self, first, last):
        self.first = first
        self.last = last

    def __repr__(self):
        return f'{self.first} {self.last}'

    def __eq__(self, other):
        return (self.first == other.first) and (self.last == other.last)

Now, if you were to try to create a set with these elements

>>> {Name('Neil', 'Stackoverflow-user')}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'Name'

So, in the case of Name, you would simply define a __hash__ method. However, in your case, this is more difficult since you have fuzzy equality semantics. The only way I can think of to get around this is to have a normalization function that you can prove will be consistent, and use the normalized string instead of the actual string as part of your hash. Take Floats as dictionary keys as an example of needing to normalize in order to use a "fuzzy" type like floats as keys.

For sorting and binary searching, since you are fuzzy-searching, you still need to be careful with things like binary searching. As an example, assume you say equality is determined by being within a certain range of Levenshtein distances. Then book and hook will similar to each other (distance = 1), but hack with a distance of 2, will be closer to hook. So how will you define a good sorting algorithm for fuzzy searching in this case?

One thing to try would be to use some form of group-by/bucketing, like a dictionary of the type Dict[int, List[MyObj]], where instances of MyObj are classified by their one constant, the self.integer field. Then you can try comparing smaller sub-lists. This would at least reduce search spaces by clustering.

Edward Minnix
  • 2,889
  • 1
  • 13
  • 26
  • 1
    Just a _at first glance_ comment. You should probably put your **TL;DR** section on top, since this is basically the purpose of it. – scharette Aug 07 '18 at 13:15
  • @scharette point taken. I generally like to put the __TL;DR__ at the bottom of the post to encourage the reader to read the whole post (especially when it's three paragraphs or less). But this post is long enough that moving it to the top is worth it. Thanks! – Edward Minnix Aug 07 '18 at 13:25
  • 2
    Yes, as you probably already know, **TL;DR** stands for _"too long, don't read"_ so putting it at the bottom to encourage people to read through your post seems counter productive. If you want to encourage them to read, don't include one (even though it is relevant here). Anyways, really good quality on your post. Good job. – scharette Aug 07 '18 at 13:28
  • @EdwardMinnix Thanks for your answer. I did not know about the __hash__ method, so thanks for explaining that, which clears up what I suspected would be an issue but wasn't sure why. It is definitely clear to me now why hashing won't work. I hadn't thought of the problems you mention with the sorting algorithm. I think I assumed that the first letters of "similar" words would always be the same so I could 'partial' sort on that, but given the source of my data, while this is very likely to be the case, this is not even necessarily true. – Neil Aug 07 '18 at 14:22
  • 1
    @EdwardMinnix I have implemented bucketing on the integers. I form dictionaries where the integers are the keys and the values are the sublists of objects that have that integer. I did not think this would provide a massive speedup, since the fuzzywuzzy calculation is in anyway skipped if the integers don't match, but I was wrong, It has provided a ~55% speedup overall (including cost of bucketing). – Neil Aug 07 '18 at 14:24