6

I have a dictionary with the following structure:

{
    1: {"names": ["name1_A", "name1_B", ...]},
    2: {"names": ["name2_A", "name2_B", ...]},
    ...
}

where name1_A and name1_B are synonyms/aliases/different ways to write the same name, whose ID is 1. name2_A and name2_B are aliases of the same name, whose ID is 2, and so on.

I need to write a function which takes a user input and returns the ID of the name whose alias is most similar to the user input.

I know it's not very intuitive to understand what I mean, so here's an example. Let's say this is my dictionary:

{
    1: {"names": ["James", "Jamie"]},
    2: {"names": ["Karen", "Karyn"]}
}

The user types in the word Jimmy. Since the closest match to Jimmy from the dictionary is Jamie, the function has to return the ID 1.

If the user types in the world Karena, since the closest match is Karen, the function has to return the ID 2.

I think the best way to get the closest math is to use difflib's get_close_matches(). However, that function takes a list of possibilities as argument, and I cannot think of a way to correctly use it in my function. Any help would be appreciated.

cs95
  • 379,657
  • 97
  • 704
  • 746
user2747949
  • 163
  • 1
  • 6
  • 13
  • 1
    How are you defining "most similar"? – Scott Hunter Jun 29 '17 at 21:12
  • When you say `Jamie` is the closet match to `Jimmy` you're comparing the highest match of each character in both strings or you have another logic behind ? – Chiheb Nexus Jun 29 '17 at 21:13
  • Search for "dynamic programming spell checker", perhaps - this is just one set of rules, and not necessarily the best. ie. maybe it would be better to use the phonetic distance? – user2864740 Jun 29 '17 at 21:14
  • @ScottHunter @ChihebNexus @user2864740 I did not actually think about the algorithm to find the most similar words, because I usually leave that job to [`difflib.get_close_matches`](https://docs.python.org/3/library/difflib.html#difflib.get_close_matches). This kind of algorithm is not part of the function I have to create. – user2747949 Jun 29 '17 at 21:16
  • Anyway, two "problems": one problem is finding the closest match in a list. Imagine if the input was: `[("James", 1), ("Jamie",1), ("Karena", 2),..]`. Once that is solved (ie. input_name -> number) it's merely a matter of changing traversal through the compound structure or searching through said structure after determining the best match (or building an intermediate look-back mapping, as above). – user2864740 Jun 29 '17 at 21:16
  • And for a list, the naive solution is `best = min(list, key=f)`, where the function `f` returns the distance by some criteria. – user2864740 Jun 29 '17 at 21:23
  • @user2864740 I think the "second problem" is what I'm trying to solve. So let's suppose I already have a way to find the most similar word given a user input and a list of words. – user2747949 Jun 29 '17 at 21:26
  • Your dictionary structure is not ideal. The `{"names": [` is unnecessary. Why not represent it as `{ 1: ["name1_A", "name1_B", ...], 2: ["name2_A", "name2_B", ...], ... }`? Are there other keys, apart from `names`? – Jedi Jun 29 '17 at 21:26
  • 1
    @Jedi Exactly, there _are_ indeed other keys apart from `names`, which are not relevant to the question – user2747949 Jun 29 '17 at 21:27
  • @user2747949 That's easy then: 1) Extract the set of names 2) Find the closest name 3) Re-iterate the dictionary keys, searching for the associated list (and the key to that list) that contains the name. This is then: "How to iterate the keys in a dictionary?", "How append to a list?", and "How to find an item in a list?" (or subtle variations of the with comprehensions). – user2864740 Jun 29 '17 at 21:28
  • @user2864740 This is a possible solution. However, I think there may be more efficient ways to achieve the same result (I may be wrong) – user2747949 Jun 29 '17 at 21:33
  • It is plenty 'efficient', and working is better than 100% efficient nothingness. Unless the function is called an impressive number of times there is probably not a point building a look-back table. – user2864740 Jun 29 '17 at 21:35
  • @user2864740 The reason I care about efficiency is that the actual dictionary has about 3500 keys in it, and each key has 4-5 names. Also, the function gets called fairly often and the resources of the machine it will be running on are pretty limited. – user2747949 Jun 29 '17 at 21:42
  • If it's called often enough that it matters then pre-create the structures: a source list (the list of names) and the mapping of name back to it's key/ID (a dictionary of name -> id). The complexity is now entirely on the "get closest" method. Simple. (And it is likely one could have written the first suggestion[s] *and* the second by now..) – user2864740 Jun 29 '17 at 21:47
  • 2
    If you want to speed this up, create a separate list of all `names` and a dict mapping each `name` to it's id. Then apply `fuzzywuzzy` or `difflib` once by checking each name once and find the appropriate key in a single lookup. – Jedi Jun 29 '17 at 21:49

1 Answers1

12

If you're interested in 3rd party modules, there's a nice little module I like to use for this sort of thing called fuzzywuzzy, for fuzzy string matching in Python. This module uses the Levenshtein Distance metric for computing distances between two strings. Here's an example of how you use it:

>>> from fuzzywuzzy import fuzz
>>> from functools import partial
>>> data_dict = {
...     1: {"names": ["James", "Jamie"]},
...     2: {"names": ["Karen", "Karyn"]}
... }
>>> input_str = 'Karena'
>>> f = partial(fuzz.partial_ratio, input_str)
>>> matches = { k : max(data_dict[k]['names'], key=f) for k in data_dict}
>>> matches
{1: 'James', 2: 'Karen'}
>>> { i : (matches[i], f(matches[i])) for i in matches }
{1: ('James', 40), 2: ('Karen', 100)}

Now, you can extract Karen since it has the highest score.

I've had to call the function twice for the purpose of this demo, but you should be able to do that just once depending on how you expand upon this example.

Another thing to note is that fuzz.partial_ratio is more lenient with its matches. For a stricter matching scheme, consider using fuzz.ratio.

You can peruse more examples using fuzzy string matching here.

cs95
  • 379,657
  • 97
  • 704
  • 746