0

I'm using the following regex in python to pull email addresses passed from a BS4 object (html page).

re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}",r.text)

The problem I'm running into, is that the regex returns stuff on top of the email address, for example, an email address on the website could be "me@email.com" however before that there could be a phone number "+441234567890" so the output would be "+441234567890me@email.com".

How could I solve this problem?

Sam
  • 533
  • 3
  • 12
  • Do you want to say emails you want to get always start with a letter? `r"[A-Za-z][A-Za-z0-9._%+-]*@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}"`? – Wiktor Stribiżew Oct 25 '19 at 14:50
  • Thanks for the reply, however, since some valid emails will start with a number, I can't exclude this. – Sam Oct 25 '19 at 14:57
  • 2
    Try wrapping the regex in the [`\b` word separator](https://docs.python.org/3/library/re.html#regular-expression-syntax) – Giacomo Alzetta Oct 25 '19 at 14:59
  • 2 up to 4 character for TLD is really short, [TLD list](https://www.iana.org/domains/root/db) – Toto Oct 25 '19 at 15:06
  • 1
    @GiacomoAlzetta is correct, you just need a word boundary. Assuming that your not trying to necessarily extract valid emails then your regex should be good for 99% of use cases. – MonkeyZeus Oct 25 '19 at 15:09
  • @GiacomoAlzetta and MonkeyZeus, thank you both for this input, I've surrounded my current regex with the \b seperator, this is working much better, thanks. – Sam Oct 25 '19 at 15:15

0 Answers0