2

I'm trying to extract any jabber accounts (emails) using regex from this page.

I've tried using regex:

\w+@[\w.-]+|\{(?:\w+, *)+\w+\}@[\w.-]+

...but it's not producing the desired results.

Jaydles
  • 251
  • 7
  • 16
PythonFun
  • 159
  • 3
  • 9
  • Welcome to SO! I tweaked some of the wording and added a tag to help improve your chance of getting an answer. You may also want to try adding more specific info about what happens when you run the coded that isn't working. Good luck! – Jaydles Mar 05 '15 at 22:11
  • have a look at: http://www.regular-expressions.info/email.html. better to scroll down to `The Official Standard: RFC 5322` section and get scared. regex is not a tool for this task. – Jason Hu Mar 05 '15 at 22:17
  • Your question has been asked many times on Stack Overflow. See http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address for my default answer for this.... – bmhkim Mar 06 '15 at 00:41

3 Answers3

5

This might work:

[^\s@<>]+@[^\s@<>]+\.[^\s@<>]+

p = re.compile(ur'[^\s@<>]+@[^\s@<>]+\.[^\s@<>]+', re.MULTILINE | re.IGNORECASE)
test_str = r'...'
re.findall(p, test_str)

See example.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • pretty close, but `.@...` is not a valid adress imho... In general: •Character . (dot, period, full stop) provided that it is not the first or last character, and provided also that it does not appear two or more times consecutively. For matching *email-adress-like-patterns* your attempt is fine. – dognose Mar 05 '15 at 22:38
  • @dognose: I did not try to create a *generic* regex, only something that would work in this case. A lot has already been said about email validation regex for Python here: http://stackoverflow.com/questions/8022530/python-check-for-valid-email-address, no need to continue it here IMO. – Wiktor Stribiżew Mar 06 '15 at 08:24
4
# -*- coding: utf-8 -*-
s = '''
...YOUR HTML page source code HERE..........

'''

import re
reobj = re.compile(r"\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}\b", re.IGNORECASE)
print re.findall(reobj, s.decode('utf-8'))

Result

[u'skypeman@jabbim.cz', u'sonics@creep.im', u'voxis_team@lsd-25.ru', u'voxis_team@lsd-25.ru', u'adhrann@jabbim.cz', u'jabberwocky@jabber.systemli.org']
Aaron
  • 2,383
  • 3
  • 22
  • 53
0

Try this one:

reg_emails=r'^((([0-9a-zA-Z]+)[\_\.\-])*([0-9a-zA-Z]+))@((([0-9a-zA-Z]+)[\_\.\-])*([0-9a-zA-Z]+))\.((([0-9a-zA-Z]+)[\_\.\-])*([0-9a-zA-Z]+))$'
slfan
  • 8,950
  • 115
  • 65
  • 78