how to Parse emails from mailto urls in Python

Question

I am trying to parse emails from web page. my code:

            import urllib2,cookielib
            import re

            site= "http://www.traidnt.net/vb/traidnt207743"
            hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
                    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                    'Accept-Encoding': 'none',
                    'Accept-Language': 'en-US,en;q=0.8',
                    'Connection': 'keep-alive'}

            req = urllib2.Request(site, headers=hdr)

            page = urllib2.urlopen(req)

            content = page.read()

            links = re.findall('mailto:.+?@.+.', content)

            for link in links:
                print link[7:-1]

and the result come like:

email1@
email2@
email3@
...

but i need to get all emails with complete form. Please how i can do that to get complete form of all emails.

Thank you!

I think what you need is a regular expression that matches email addresses: http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address — Julien Spronck, Mar 25 '15 at 22:01

Hugo Sousa · Accepted Answer · 2015-03-25T22:22:10.957

0

I just added the following code to your code and it works perfectly:

regexp = re.compile(("mailto:([a-z0-9!#$%&'*+\/=?^_`{|}~-]+@[a-z0-9]+\.[a-zA-Z0-9-.]+)"))

links = re.findall(regexp, content)

print links

Output:

['njm-kwt@hotmail.com', 'fnan-ksa@hotmail.com', 'k-w-t7@hotmail.com', 'coool-uae@hotmail.com', 'qsd@hotmail.de', 'o1ooo@hotmail.de', 'm-p-3@hotmail.de', 'ya7oo@hotmail.de', 'g5x@hotmail.de', 'f7t@hotmail.de', 'm2y@hotmail.de', 's2udi@hotmail.de', 'q2tar@hotmail.de', 'kuw2it@hotmail.de', 's2udi@hotmail.fr', 'qxx@hotmail.de', 'y-e-s@hotmail.de', 'y-a@hotmail.de', 'qqj@hotmail.de', 'qjj@hotmail.de', 'admin_vb@hotmail.de', 'eng-vb@hotmail.com', 'a3lantk@hotmail.com', 'a3lnkm@hotmail.com', 't7t@hotmail.de', 'mohamed_fathy41@hotmail.com', 'ox-9@hotmail.com', 'ox-9@hotmail.com']

edited Mar 25 '15 at 22:22

answered Mar 25 '15 at 22:08

Hugo Sousa

906
2
9
27

1

The regex you give isn't sufficient. It will miss hyphenated domain names such as 'foo-bar.com', as an example. Obviously, there are a litany of stack overflow answers along these lines, http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address and http://stackoverflow.com/questions/8022530/python-check-for-valid-email-address come to mind (I'd make the regex from the latter `[^@\s]+@[^@\s]+\.[^@\s]+` to exclude white space, but the general point stands) – bmhkim Mar 25 '15 at 22:58
@hugo-sousa @bmhkim its not working for the most websites. for example this website `http://www.hotm-il.com/vb/showthread.php?t=18249` – yuyb0y Mar 26 '15 at 13:50
Hummm, did you try to remove the "mailto:" ? – Hugo Sousa Mar 26 '15 at 14:24
I just ran it without the `mailto`: and I got this: `['x4r@msn.com', 'x4r@msn.com', 'x4r@msn.com', 'x4r@msn.com',...]` – Hugo Sousa Mar 26 '15 at 14:49
@hugo-sousa So i need more than one way `mailto:([a-z0-9!#$%&'*+\/=?^_`{|}~-]+@[a-z0-9]+\.[a-zA-Z0-9-.]+)` , `([a-z0-9!#$%&'*+\/=?^_`{|}~-]+@[a-z0-9]+\.[a-zA-Z0-9-.]+)` and `[\w\.-]+@[\w\.-]+` but can give me the most way to catch more emails "I plan to make my programe to use all this ways one by one to ctach more emails" – yuyb0y Mar 26 '15 at 15:23
@hugo-sousa I dont know why its not working. please can help me with this website `http://www.almirkaz.com/index.php?option=com_sobi2&Itemid=264` i need to get all emails from that website. – yuyb0y Mar 26 '15 at 18:39

score 0 · Answer 2 · answered May 14 '15 at 19:05

0

You shold use special library like that

https://pypi.python.org/pypi/urlinfo

and contribute and create issue to make Python better ;)

answered May 14 '15 at 19:05

Vitold S.

402
4
13

how to Parse emails from mailto urls in Python

2 Answers2