Email Regex within Python

Asked Oct 25 '19 at 14:49

Active Oct 25 '19 at 15:10

Viewed 74 times

I'm using the following regex in python to pull email addresses passed from a BS4 object (html page).

re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}",r.text)

The problem I'm running into, is that the regex returns stuff on top of the email address, for example, an email address on the website could be "me@email.com" however before that there could be a phone number "+441234567890" so the output would be "+441234567890me@email.com".

How could I solve this problem?

asked Oct 25 '19 at 14:49

Sam

Do you want to say emails you want to get always start with a letter? `r"[A-Za-z][A-Za-z0-9._%+-]*@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}"`? – Wiktor Stribiżew Oct 25 '19 at 14:50
Thanks for the reply, however, since some valid emails will start with a number, I can't exclude this. – Sam Oct 25 '19 at 14:57
2

Try wrapping the regex in the [`\b` word separator](https://docs.python.org/3/library/re.html#regular-expression-syntax) – Giacomo Alzetta Oct 25 '19 at 14:59
2 up to 4 character for TLD is really short, [TLD list](https://www.iana.org/domains/root/db) – Toto Oct 25 '19 at 15:06
1

@GiacomoAlzetta is correct, you just need a word boundary. Assuming that your not trying to necessarily extract valid emails then your regex should be good for 99% of use cases. – MonkeyZeus Oct 25 '19 at 15:09
@GiacomoAlzetta and MonkeyZeus, thank you both for this input, I've surrounded my current regex with the \b seperator, this is working much better, thanks. – Sam Oct 25 '19 at 15:15

Email Regex within Python

0 Answers0