0

I am trying to find some email addresses in a source-code and match them with the first name and last name of the person they are associated with. The first step in my process is to find the first name and the last name of someone. I have a function that does that very well and return a list of full name.

The second step is to find the email address which is the closest to that name (whether it is displayed before the name or after). So I am looking for both: email before and email after.

For that particular purpose I wrote this regex expression:

for name in full_name_list:
        # full name followed by the email
        print(re.findall(name+'.*?([A-z0-9_.-]+?@[A-z0-9_.-]+?\.[A-z]+)', source))
        # email followed by full name
        print(re.findall('([A-z0-9_.-]+?@[A-z0-9_.-]+?\.\w+?.+?'+name+')', source))

Now here is the deal, assuming that my source code is like that and that my full_name_list=['John Doe', 'James Henry', 'Jane Doe']:

" John Doe is part of our team and here is his email: johndoe@something.com. James Henry is also part of our team and here his email: jameshenry@something.com. Jane Doe is the team manager and you can contact her at that address: janedoe@something.com"

The first regex returns the name with the closest email after it, which is what I want. However the second regex always starts from the first email it founds and stops when it matches the name, which is odd since I asking to look for the least amount of character between the email and the name.... (or at least I think I am)

Is my assumption correct? If yes, what's happening? If not, how can I fix that?

martineau
  • 119,623
  • 25
  • 170
  • 301

2 Answers2

1

The issue is that your pattern has .*? between email and name patterns, and since regex engine parses the string from left to right, matching starts from the first email and then matches up to the leftmost occurrence of name potentially matching across any amount of other emails.

You may use

import re
full_name_list=['John Doe', 'James Henry', 'Jane Doe']
source = r" John Doe is part of our team and here is his email: johndoe@something.com. James Henry is also part of our team and here his email: jameshenry@something.com. Jane Doe is the team manager and you can contact her at that address: janedoe@something.com"
for name in full_name_list:
    # full name followed by the email
    name_email = re.search(r'\b' + name+r'\b.*?([\w.-]+@[\w.-]+\.w+)', source)
    if name_email:
        print( 'Email before "{}" keyword: {}'.format(name, name_email.group(1)) )
    # email followed by full name
    email_name = re.search(r'\b([\w.-]+@[\w.-]+\.\w+)(?:(?![\w.-]+@[\w.-]+\.\w).)*?\b'+name+r'\b', source, re.S)
    if email_name:
        print( 'Email after "{}" keyword: {}'.format(name, email_name.group(1)) )

See the Python demo.

Output:

Email after "James Henry" keyword: johndoe@something.com
Email after "Jane Doe" keyword: jameshenry@something.com

Notes:

  • [A-z] matches more than just ASCII letters, you most probably want to use \w instead of [A-Za-z0-9_] (although \w also matches any Unicode letters and digits, but you may turn this behavior off if you pass re.ASCII flag to re.compile)
  • \b is a word boundary, it is advisable to add it at the start and end of the name variable to match names as whole words
  • (?:(?![\w.-]+@[\w.-]+\.\w).)*? is the fix for your current issue, namely, this pattern matches the closest text between the email and the subsequent name. It matches any char ((?:.)), 0 or more occurrences (*?), that is not a starting symbol for an email [\w.-]+@[\w.-]+\.\w pattern.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thank you so much Wiktor for taking the time to write such a complete answer. It works very well indeed. Although I don't fully understand the purpose of '?!' at the beginning of ```(?![\w.-]+@[\w.-]+\.\w).``` and the '.' at the end. – Thomas AMET Apr 14 '20 at 00:57
  • Just to be sure that I fully understand what I am doing... If I write ```r"(?:\w+)@\w+\.\w+" ```, I will both match ```r"test@test.com & @test2.com"``` ? ```?:``` means that the group is not mandatory match, right ? – Thomas AMET Apr 14 '20 at 01:26
  • 1
    @ThomasAMET You may read more about this [*tempered greedy token*](https://stackoverflow.com/a/37343088/3832970) construct. `\w*@\w+\.\w+` will match `test@test.com` and `@test2.com`. – Wiktor Stribiżew Apr 14 '20 at 06:56
  • 1
    That makes perfect sense now and it is indeed a powerful technique to know ! The article you shared is really clear and I can only recommend it for those who might read this thread. Thanks again @Wiktor – Thomas AMET Apr 14 '20 at 15:59
0

First, separate the email from the domain so that the @something.com is in a different column.

Next, it sounds like you're describing an algorithm for fuzzy matching called Levenshtein distance. You can use a module designed for this, or perhaps write a custom one:

import numpy as np
def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
    """ levenshtein_ratio_and_distance:
        Calculates levenshtein distance between two strings.
        If ratio_calc = True, the function computes the
        levenshtein distance ratio of similarity between two strings
        For all i and j, distance[i,j] will contain the Levenshtein
        distance between the first i characters of s and the
        first j characters of t
    """
    # Initialize matrix of zeros
    rows = len(s)+1
    cols = len(t)+1
    distance = np.zeros((rows,cols),dtype = int)

    # Populate matrix of zeros with the indeces of each character of both strings
    for i in range(1, rows):
        for k in range(1,cols):
            distance[i][0] = i
            distance[0][k] = k

    # Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions    
    for col in range(1, cols):
        for row in range(1, rows):
            if s[row-1] == t[col-1]:
                cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
            else:
                # In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
                # the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
                if ratio_calc == True:
                    cost = 2
                else:
                    cost = 1
            distance[row][col] = min(distance[row-1][col] + 1,      # Cost of deletions
                                 distance[row][col-1] + 1,          # Cost of insertions
                                 distance[row-1][col-1] + cost)     # Cost of substitutions
    if ratio_calc == True:
        # Computation of the Levenshtein Distance Ratio
        Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
        return Ratio
    else:
        # print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
        # insertions and/or substitutions
        # This is the minimum number of edits needed to convert string a to string b
        return "The strings are {} edits away".format(distance[row][col])

Now you can get a numerical value for how similar they are. You'll still need to establish a cutoff as to what number is acceptable to you.

Str1 = "Apple Inc."
Str2 = "apple Inc"
Distance = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower())
print(Distance)
Ratio = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower(),ratio_calc = True)
print(Ratio)

There are other similarity algorithms other than Levenshtein. You might try Jaro-Winkler, or perhaps Trigram.

I got this code from: https://www.datacamp.com/community/tutorials/fuzzy-string-python

Josh
  • 1,493
  • 1
  • 13
  • 24
  • Thanks Josh, I will have a look at that algorithm. I don't get however why you want me to split the @something.com in a different column ... I dont use columns at all. Happy to be clarified on that point. Let me add that in my real case it might be that the email for james henry is jh@something.com – Thomas AMET Apr 13 '20 at 15:12
  • 1
    Well I assume you want to compare the email portion, to the name column. If you use the @something.com as a part of the email, then Levenshtein will calculate all of those deletions necessary to match the name, which means that the length of the domain would hurt your similarity score. So, johndoe@aol.com <=> John Doe would get a lower score than if you had removed the domain and compared johndoe <=> John Doe. Another thing you might want to do before the comparison is remove all spaces as well. – Josh Apr 13 '20 at 15:15
  • That is acutally an excellent idea but not what I intended to do at the beginning. My question here is also how to change the second regex to have the least amount of characters between the email and the name. – Thomas AMET Apr 13 '20 at 15:15
  • Oh sorry, I must have misunderstood. What is source in this code example? – Josh Apr 13 '20 at 15:21
  • I made it up.. I actually work on that page for some tests: https://www.actibel.be/contact/ – Thomas AMET Apr 13 '20 at 15:41