0

I am trying to parse emails from web page. my code:

            import urllib2,cookielib
            import re

            site= "http://www.traidnt.net/vb/traidnt207743"
            hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
                    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                    'Accept-Encoding': 'none',
                    'Accept-Language': 'en-US,en;q=0.8',
                    'Connection': 'keep-alive'}

            req = urllib2.Request(site, headers=hdr)

            page = urllib2.urlopen(req)

            content = page.read()

            links = re.findall('mailto:.+?@.+.', content)

            for link in links:
                print link[7:-1]

and the result come like:

email1@
email2@
email3@
...

but i need to get all emails with complete form. Please how i can do that to get complete form of all emails.

Thank you!

yuyb0y
  • 3
  • 6
  • I think what you need is a regular expression that matches email addresses: http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address – Julien Spronck Mar 25 '15 at 22:01

2 Answers2

0

I just added the following code to your code and it works perfectly:

regexp = re.compile(("mailto:([a-z0-9!#$%&'*+\/=?^_`{|}~-]+@[a-z0-9]+\.[a-zA-Z0-9-.]+)"))

links = re.findall(regexp, content)

print links

Output:

['njm-kwt@hotmail.com', 'fnan-ksa@hotmail.com', 'k-w-t7@hotmail.com', 'coool-uae@hotmail.com', 'qsd@hotmail.de', 'o1ooo@hotmail.de', 'm-p-3@hotmail.de', 'ya7oo@hotmail.de', 'g5x@hotmail.de', 'f7t@hotmail.de', 'm2y@hotmail.de', 's2udi@hotmail.de', 'q2tar@hotmail.de', 'kuw2it@hotmail.de', 's2udi@hotmail.fr', 'qxx@hotmail.de', 'y-e-s@hotmail.de', 'y-a@hotmail.de', 'qqj@hotmail.de', 'qjj@hotmail.de', 'admin_vb@hotmail.de', 'eng-vb@hotmail.com', 'a3lantk@hotmail.com', 'a3lnkm@hotmail.com', 't7t@hotmail.de', 'mohamed_fathy41@hotmail.com', 'ox-9@hotmail.com', 'ox-9@hotmail.com']
Hugo Sousa
  • 906
  • 2
  • 9
  • 27
  • 1
    The regex you give isn't sufficient. It will miss hyphenated domain names such as 'foo-bar.com', as an example. Obviously, there are a litany of stack overflow answers along these lines, http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address and http://stackoverflow.com/questions/8022530/python-check-for-valid-email-address come to mind (I'd make the regex from the latter `[^@\s]+@[^@\s]+\.[^@\s]+` to exclude white space, but the general point stands) – bmhkim Mar 25 '15 at 22:58
  • @hugo-sousa @bmhkim its not working for the most websites. for example this website `http://www.hotm-il.com/vb/showthread.php?t=18249` – yuyb0y Mar 26 '15 at 13:50
  • Hummm, did you try to remove the "mailto:" ? – Hugo Sousa Mar 26 '15 at 14:24
  • I just ran it without the `mailto`: and I got this: `['x4r@msn.com', 'x4r@msn.com', 'x4r@msn.com', 'x4r@msn.com',...]` – Hugo Sousa Mar 26 '15 at 14:49
  • @hugo-sousa So i need more than one way `mailto:([a-z0-9!#$%&'*+\/=?^_`{|}~-]+@[a-z0-9]+\.[a-zA-Z0-9-.]+)` , `([a-z0-9!#$%&'*+\/=?^_`{|}~-]+@[a-z0-9]+\.[a-zA-Z0-9-.]+)` and `[\w\.-]+@[\w\.-]+` but can give me the most way to catch more emails "I plan to make my programe to use all this ways one by one to ctach more emails" – yuyb0y Mar 26 '15 at 15:23
  • @hugo-sousa I dont know why its not working. please can help me with this website `http://www.almirkaz.com/index.php?option=com_sobi2&Itemid=264` i need to get all emails from that website. – yuyb0y Mar 26 '15 at 18:39
0

You shold use special library like that

https://pypi.python.org/pypi/urlinfo

and contribute and create issue to make Python better ;)

Vitold S.
  • 402
  • 4
  • 13