1

I am having a bit of trouble in terms of runtime for an algorithm that matches names with the most likely email address. The function itself works well (in that it pairs the name and email address correctly), but the runtime is so grand that it is difficult to implement on large data sets. I am a beginner at coding and would love to hear what solutions you guys might offer.

quick note I implemented Levenshtein's algorithm here. If there are more efficient algorithms, comment below!

    from string import digits
    import copy
    import re
    # levenshtein algorithm found on https://www.python-course.eu/levenshtein_distance.php
    def call_counter(func):
        def helper(*args, **kwargs):
            helper.calls += 1
            return func(*args, **kwargs)
        helper.calls = 0
        helper.__name__= func.__name__
        return helper
    def memoize(func):
        mem = {}
        def memoizer(*args, **kwargs):
            key = str(args) + str(kwargs)
            if key not in mem:
                mem[key] = func(*args, **kwargs)
            return mem[key]
        return memoizer
    @call_counter
    @memoize    
    def levenshtein(s, t):
        if s == "":
            return len(t)
        if t == "":
            return len(s)
        if s[-1] == t[-1]:
            cost = 0
        else:
            cost = 1

        res = min([levenshtein(s[:-1], t)+1,
                   levenshtein(s, t[:-1])+1, 
                   levenshtein(s[:-1], t[:-1]) + cost])
        return res

    def emailmatch(emails_file,name_file):
        name_email_match = {} #store the matching emails in a dictionary
        with open(name_file, 'r') as names:
            match_name = 0
            for individual in names:
                with open(emails_file,'r') as address_emails:
                    first_name = individual[:(individual.index(" "))].lower()
                    last_name = individual[(individual.rindex(" ")):].lower()
                    full_name = (first_name + last_name).lower()
                    full_name_period = (first_name+"."+last_name).lower()
                    best_match = "" #this holds the best matching email
                    minimum = 999
                    for emails in address_emails:
                        email = emails[0:(emails.index('@'))]
                        temp = min(levenshtein(last_name,email),
                                   levenshtein(first_name,email),
                                   levenshtein(full_name,email),
                                   levenshtein(full_name_period,email))
                        if (temp < minimum):
                            minimum = temp
                            best_match = emails
                    name_email_match[individual] = best_match
        return name_email_match
    emailmatch('emails.txt', 'names.txt')
DaTerpay
  • 11
  • 2
  • 1
    I guess, this will get more feedback, when asked at [CodeReview](https://codereview.stackexchange.com/). – Mr. T Nov 30 '17 at 19:50
  • See [here](https://stackoverflow.com/q/2460177/1016065) for alternative algorithms. Also, check [PEP8](https://www.python.org/dev/peps/pep-0008/) for recommended coding style. Your question is [rather broad for this Q&A](https://stackoverflow.com/help/how-to-ask), you may want to post it on [Code Review](https://codereview.stackexchange.com/) instead. – RolfBly Nov 30 '17 at 20:02
  • It would help a lot if you can describe the problem and what your algorithm does. Deriving that from just the code is pretty tedious. – Nico Schertler Nov 30 '17 at 21:17
  • The Levenshtein Algorithm pretty much returns a number which approaches 0 as the strings get more similar. I made a nested for loop to go through each individual name and each email. For each name I would compare the name and current email and store the Levenshtein value in a counter and the email name. if a different email was a better match (aka the Levenshtein number was lower), I would switch it to the current email. @NicoSchertler – DaTerpay Dec 01 '17 at 01:22

0 Answers0