1

I am trying to match two lists of strings with names that are written differently and have partial matches:

list1 = {'ADELA SARABIA', 'JUAN PEREZ', 'JOHN ADAMS', 'TOM HANKS'}
list2 = {'JOSE GARCIA', 'HANKS TOM', 'PEREZ LOPEZ JUAN', 'JOHN P. ADAMS'}

I want to keep the names that appear in both lists even though have only partial matches. Desire output:

matches = {'JUAN PEREZ', 'JOHN ADAMS', 'TOM HANKS'}

I was using this code frome another stackoverflow question, but doesnt work with my case:

lst = []
for i in list1:
    has_match = False
    for j in list2:
        if i.split()[0] in j:
            has_match = True
            print(i, j)
            if j not in lst:
                lst.append(j)
        if len(i) > 1:
            k = ' '.join(i.split()[:2])
            if k in j:
                has_match = True
                print(i, j)
                if j not in lst:
                    lst.append(j)
    if not has_match:
        lst.append(i + ' - not found')
  • 2
    You might need other special cases, like potentially ignoring a middle name or initial. The Levenshtein Distance Algorithm may or may not help too, depending on what sort of differences ypu might get – doctorlove Aug 24 '23 at 13:27
  • Does this answer your question? [How to retrieve partial matches from a list of strings](https://stackoverflow.com/questions/64127075/how-to-retrieve-partial-matches-from-a-list-of-strings) – JeffUK Aug 24 '23 at 13:27
  • @JeffUK thank you for your comment, but i cant use startswith with a list of strings – Daniela saba rosner Aug 24 '23 at 13:57
  • The answer linked also includes an `in` option, note it uses a filter, so applies it to each element, not to the list of strings – JeffUK Aug 24 '23 at 14:26

3 Answers3

0

My first idea is to split the names and use a set intersection to determine partial matches (assuming each name in a full name is unique):

list1 = {'ADELA SARABIA', 'JUAN PEREZ', 'JOHN ADAMS', 'TOM HANKS'}
list2 = {'JOSE GARCIA', 'HANKS TOM', 'PEREZ LOPEZ JUAN', 'JOHN P. ADAMS'}
matches = []

for name1 in list1:
    split1 = set(name1.split(' '))
    for name2 in list2:
        split2 = set(name2.split(' '))
        if split1.intersection(split2) == min(split1, split2, key=len):
            matches.append(name1)
            break

print(set(matches))

Output:

{'JUAN PEREZ', 'TOM HANKS', 'JOHN ADAMS'}
B Remmelzwaal
  • 1,581
  • 2
  • 4
  • 11
  • thank you for your answer but what if i have this list list1 = {'ADELA SARABIA', 'JUAN PEREZ', 'JOHN ADAMS', 'TOM HANKS HARDY'} the output removes the match of TOM HANKS, there is possible to add more separators (' ')? – Daniela saba rosner Aug 24 '23 at 13:45
  • @Danielasabarosner I assumed from your example that `list1` would always contain the shorter names. I have updated my answer accordingly. – B Remmelzwaal Aug 24 '23 at 14:22
0

Use the below list comprehension to get your result:

[name for name in list1 if any([any([part_name in other_name for other_name in list2]) for part_name in name.split()])]

Output:

['JUAN PEREZ', 'TOM HANKS', 'JOHN ADAMS']
0

This work exactly as i expected

def calculate_similarity(string1, string2):
words1 = set(string1.split())
words2 = set(string2.split())
common_words = words1 & words2
similarity = len(common_words) / min(len(words1), len(words2))
return similarity

matches = set()

for item1 in list1:
    best_similarity = 0
    best_match = None

for item2 in list2:
    similarity = calculate_similarity(item1, item2)
    if similarity > best_similarity:
        best_similarity = similarity
        best_match = item2

if best_similarity > 0.7:  # Adjust the threshold as needed
    matches.add(best_match)

print("Matches:", matches)
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Aug 27 '23 at 16:20