0

I'm trying to extract all e-mails from HTML except for e-mails preceded by a mailto: tag.

It works except when someone has a . in their e-mail address which is fairly common - I've tried look-aheads, different kinds of boundaries and I just can't seem to get regex to exclude the entire e-mail if it has a . in it and is preceded by a mailto: tag

Regex: (?<!mailto:)(\b[a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+)

Test string: Maecenas sed diam eget risus fake.name@domain.net varius blandit sit amet non magna. Sed posuere consectetur est at lobortis. Maecenas faucibus <a href="mailto:fake.name@domain.net">fake.name@domain.net</a> mollis interdum.

The first and last match are fine the 2nd match is .name@domain.net when I don't want it to match at all.

metigue
  • 21
  • 3
  • See [Regex Pattern to Match, Excluding when… / Except between](https://stackoverflow.com/questions/23589174/regex-pattern-to-match-excluding-when-except-between) – Wiktor Stribiżew Sep 27 '19 at 11:17
  • There is a simpler approach: use BeautifulSoup to remove all `href`s with `mailto:` and then run `re.findall(r'[a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+', remaining_plain_text)` – Wiktor Stribiżew Sep 27 '19 at 11:22
  • Are you using BeautifulSoup already? – Wiktor Stribiżew Sep 27 '19 at 11:50
  • I have an approach that works by removing the hrefs with mailto but it's not suitable. I need a solution to do this in one regex – metigue Sep 27 '19 at 13:46
  • So, you are extracting? But you will still need to write some code. Try `[y for x,y in re.findall(r'(mailto:[^>\s]*?)?\b([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+)', text) if not x]` - see [this demo](https://ideone.com/npSuOQ). – Wiktor Stribiżew Sep 27 '19 at 13:53
  • If you have PyP regex module you can do it with a single call to `regex.findall`, like `regex.findall(r'\b(?<!mailto:[^>\s]*?)\w[\w.-]*@[\w.-]+\.[\w.-]+', text)` – Wiktor Stribiżew Sep 27 '19 at 14:06

0 Answers0