Extract emails from html using regex

Question

I'm trying to extract any jabber accounts (emails) using regex from this page.

I've tried using regex:

\w+@[\w.-]+|\{(?:\w+, *)+\w+\}@[\w.-]+

...but it's not producing the desired results.

Welcome to SO! I tweaked some of the wording and added a tag to help improve your chance of getting an answer. You may also want to try adding more specific info about what happens when you run the coded that isn't working. Good luck! — Jaydles, Mar 05 '15 at 22:11
have a look at: http://www.regular-expressions.info/email.html. better to scroll down to `The Official Standard: RFC 5322` section and get scared. regex is not a tool for this task. — Jason Hu, Mar 05 '15 at 22:17
Your question has been asked many times on Stack Overflow. See http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address for my default answer for this.... — bmhkim, Mar 06 '15 at 00:41

score 5 · Accepted Answer · answered Mar 05 '15 at 21:40

5

This might work:

[^\s@<>]+@[^\s@<>]+\.[^\s@<>]+

p = re.compile(ur'[^\s@<>]+@[^\s@<>]+\.[^\s@<>]+', re.MULTILINE | re.IGNORECASE)
test_str = r'...'
re.findall(p, test_str)

See example.

answered Mar 05 '15 at 21:40

Wiktor Stribiżew

607,720
39
448
563

pretty close, but `.@...` is not a valid adress imho... In general: •Character . (dot, period, full stop) provided that it is not the first or last character, and provided also that it does not appear two or more times consecutively. For matching *email-adress-like-patterns* your attempt is fine. – dognose Mar 05 '15 at 22:38
@dognose: I did not try to create a *generic* regex, only something that would work in this case. A lot has already been said about email validation regex for Python here: http://stackoverflow.com/questions/8022530/python-check-for-valid-email-address, no need to continue it here IMO. – Wiktor Stribiżew Mar 06 '15 at 08:24

score 4 · Answer 2 · answered Mar 06 '15 at 00:18

# -*- coding: utf-8 -*-
s = '''
...YOUR HTML page source code HERE..........

'''

import re
reobj = re.compile(r"\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}\b", re.IGNORECASE)
print re.findall(reobj, s.decode('utf-8'))

Result

[u'skypeman@jabbim.cz', u'sonics@creep.im', u'voxis_team@lsd-25.ru', u'voxis_team@lsd-25.ru', u'adhrann@jabbim.cz', u'jabberwocky@jabber.systemli.org']

score 0 · Answer 3 · edited Sep 10 '17 at 09:16

0

Try this one:

reg_emails=r'^((([0-9a-zA-Z]+)[\_\.\-])*([0-9a-zA-Z]+))@((([0-9a-zA-Z]+)[\_\.\-])*([0-9a-zA-Z]+))\.((([0-9a-zA-Z]+)[\_\.\-])*([0-9a-zA-Z]+))$'

edited Sep 10 '17 at 09:16

slfan

8,950
115
65
78

answered Sep 10 '17 at 08:48

ytldsimage

1
1

Extract emails from html using regex

3 Answers3

Result

Linked