0

In python, I am trying to create an algorithm that will use characters from an email address and search the page to calculate the likelihood the string is the actual name of the person. I wrote a regex expression to grab all emails on a page, but then I want to write another to try and find the persons name from the email (since it is a subset or some characters of the name make it up).

I am using:

 self.reEmail = re.compile(r"\b(?!(?:.\B)*(.)(?:\B.)*\1)[char]+\b", re.IGNORECASE)

However this is giving me all single characters.

email: bjoel@email.edu

Name : Billy Joel - is what I want to scrape.

However it is not always the first letter of the email is the first name...

user2006018
  • 101
  • 3
  • 11
  • I am not trying to validate an email address, I want to apply an algo like difflib or levenshtein to grab the name related to the email address.. However I tried both and difflib depends on character order along with levenshtein which will calculate distance away. – user2006018 Sep 18 '16 at 19:55
  • 1
    @Jan: maybe a little too quick! Ok, don't close - don't close... – Jean-François Fabre Sep 18 '16 at 20:05
  • Use [`\b(\S+@\S+)\b`](https://regex101.com/r/vF9iZ2/1) and have a look at http://stackoverflow.com/questions/18134437/where-can-the-documentation-for-python-levenshtein-be-found-online – Jan Sep 18 '16 at 20:05
  • @Jan I used that regex to get all the emails, but now I want to search the DOM again and try and find that persons name that corresponds to the email. That regex only provides the emails which I already have. – user2006018 Sep 18 '16 at 20:10
  • I hope that it's not for spam purposes. I don't think that Billy Joel would appreciate this. – Jean-François Fabre Sep 18 '16 at 21:03
  • There is no regular expression that can match `Billy Joel` from `bjoel@email.edu` simply because the given name is not in there completely. But that's only one of the many problems you are facing. Solving all would be far too broad for SO. – Klaus D. Sep 19 '16 at 01:08
  • It is a duplicate because once you parse it, you will have the real name in the correctly notated capture buffers from the subexpression matches. – tchrist Sep 22 '16 at 02:53

0 Answers0