Python regex to remove capture email between dashes or ignore emails ending with .jpg etc

Question

I am trying to figure out how to improve the regex to only get emails not ending with ".jpg" and to remove -- from both left and right part of the emails if any is found. Example parameter as source which is a string.

<html>
   <body>
   <p>aaa@example.jpg</p>
   <p>--bbb@example.com--</p>
   <p>ccc@example.com--</p>
   <p>--ddd@example.com</p>

</body>
</html>

The result should contain: bbb@example.com, ccc@example.com, ddd@example.com So basically, I want to see anyway to improve this function so the regex would could produce emails without -- and if possible improve the if not email[0].endswith('.png') in case i want to add more, this could look urgly.

def extract_emails(source):

    regex = re.compile(r'([\w\-\.]{1,100}@(\w[\w\-]+\.)+[\w\-]+)')
    emails = list(set(regex.findall(source.decode("utf8"))))
    all_emails = []
    for email in emails:
        if not email[0].endswith('.png') and not email[0].endswith('.jpg') \
                and not email[0].endswith('.gif') and not email[0].endswith('.rar')\
                and not email[0].endswith('.zip') and not email[0].endswith('.swf'):
            all_emails.append(email[0].lower())

    return list(set(all_emails))

@Epodax mistakenly selected all suggested tags. – Jide Koso Dec 03 '15 at 10:37 — Jide Koso, Dec 03 '15 at 10:37
Don't use regex, use html parser – styvane Dec 03 '15 at 10:49 — styvane, Dec 03 '15 at 10:49

score 2 · Answer 1 · edited May 23 '17 at 12:07

I think top level domains are few so you can use alternation

s="""<html>
   <body>
   <p>aaa@example.jpg</p>
   <p>--bbb@example.com--</p>
   <p>ccc@example.com--</p>
   <p>--ddd@example.com</p>

</body>
</html>"""
print re.findall(r"-*([\w\.]{1,100}@\w[\w\-]+\.+com|biz|us|bd)-*",s)

['bbb@example.com', 'ccc@example.com', 'ddd@example.com']

see DEMO

or try \w+@\w+\.(?!jpg|png)\w+\.*\w*

s="""<html>
   <body>
   <p>aaa@example.jpg</p>
   <p>--bbb@example.com--</p>
   <p>ccc@example.com--</p>
   <p>--ddd@example.com</p>

</body>
</html>"""
print re.findall(r"\w+@\w+\.(?!jpg|png)\w+\.*\w*",s)

It is very hard to set constant regex for email verification- Details for email validation go at Using a regular expression to validate an email address it has 69 answers.

without trying it i think it would not pass this: --ddd@example.com — Jide Koso, Dec 03 '15 at 10:41

score 1 · Accepted Answer · answered Dec 03 '15 at 10:41

1

x="""<html>
   <body>
   <p>aaa@example.jpg</p>
   <p>--bbb@example.com--</p>
   <p>ccc@example.com--</p>
   <p>--ddd@example.com</p>

</body>
</html>"""
print re.findall(r"-*([\w\-\.]{1,100}@(?:\w[\w\-]+\.)+(?!jpg)[\w]+)-*",x)

Output:['bbb@example.com', 'ccc@example.com', 'ddd@example.com']

answered Dec 03 '15 at 10:41

vks

67,027
10
91
124

1

would fail for this
222@example.jpg.com
clearly no perfect regex for email. but the regex by @Uchicha solves the problem. – Jide Koso Dec 03 '15 at 10:59

styvane · Answer 3 · 2015-12-03T10:52:33.360

The best way to do this is using html parser like BeautifulSoup

In [37]: from bs4 import BeautifulSoup

In [38]: soup = BeautifulSoup('''<html>
   ....:    <body>
   ....:    <p>aaa@example.jpg</p>
   ....:    <p>--bbb@example.com--</p>
   ....:    <p>ccc@example.com--</p>
   ....:    <p>--ddd@example.com</p>
   ....:
   ....: </body>
   ....: </html>''', 'lxml')

In [39]: [email.strip('-') for email in soup.stripped_strings if not email.endswith('.jpg')]
Out[39]: ['bbb@example.com', 'ccc@example.com', 'ddd@example.com']

Python regex to remove capture email between dashes or ignore emails ending with .jpg etc

3 Answers3