How to exclude the entire expression and not just the first part with negative look-behind in block text

Question

I'm trying to extract all e-mails from HTML except for e-mails preceded by a mailto: tag.

It works except when someone has a . in their e-mail address which is fairly common - I've tried look-aheads, different kinds of boundaries and I just can't seem to get regex to exclude the entire e-mail if it has a . in it and is preceded by a mailto: tag

Regex: (?<!mailto:)(\b[a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+)

Test string: Maecenas sed diam eget risus fake.name@domain.net varius blandit sit amet non magna. Sed posuere consectetur est at lobortis. Maecenas faucibus <a href="mailto:fake.name@domain.net">fake.name@domain.net</a> mollis interdum.

The first and last match are fine the 2nd match is .name@domain.net when I don't want it to match at all.

See [Regex Pattern to Match, Excluding when… / Except between](https://stackoverflow.com/questions/23589174/regex-pattern-to-match-excluding-when-except-between) — Wiktor Stribiżew, Sep 27 '19 at 11:17
There is a simpler approach: use BeautifulSoup to remove all `href`s with `mailto:` and then run `re.findall(r'[a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+', remaining_plain_text)` — Wiktor Stribiżew, Sep 27 '19 at 11:22
I have an approach that works by removing the hrefs with mailto but it's not suitable. I need a solution to do this in one regex — metigue, Sep 27 '19 at 13:46
So, you are extracting? But you will still need to write some code. Try `[y for x,y in re.findall(r'(mailto:[^>\s]*?)?\b([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+)', text) if not x]` - see [this demo](https://ideone.com/npSuOQ). — Wiktor Stribiżew, Sep 27 '19 at 13:53
If you have PyP regex module you can do it with a single call to `regex.findall`, like `regex.findall(r'\b(?<!mailto:[^>\s]*?)\w[\w.-]*@[\w.-]+\.[\w.-]+', text)` — Wiktor Stribiżew, Sep 27 '19 at 14:06

How to exclude the entire expression and not just the first part with negative look-behind in block text

0 Answers0