How do I parse only the links from a webpage in Python?

Question

links = re.findall(r'\w+://\w+.\w+.\w+\w+\w.+"', page)

to parse the links from a webpage.

Please any help will be appreciated. This is what I get from parsing http://www.soc.napier.ac.uk/~cs342/CSN08115/cw_webpage/index.html:

        #my current output#
        http://net.tutsplus.com/tutorials/other/8-regular-expressions-you-should-know/"
        http://www.asecuritysite.com/content/icon_clown.gif" alt="if broken see alex@school.ac.uk +44(0)1314552759" height="100"
        http://www.rottentomatoes.com/m/sleeper/"
        http://www.rottentomatoes.com/m/sleeper/trailer/"
        http://www.rottentomatoes.com/m/star_wars/"
        http://www.rottentomatoes.com/m/star_wars/trailer/"
        http://www.rottentomatoes.com/m/wargames/"
        http://www.rottentomatoes.com/m/wargames/trailer/"
        https://www.sans.org/press/sans-institute-and-crowdstrike-partner-to-offer-hacking-exposed-live-webinar-series.php"> SANS to Offer "Hacking Exposed Live"
        https://www.sans.org/webcasts/archive/2013"

        #I want to get this when i run the module#
        http://net.tutsplus.com/tutorials/other/8-regular-expressions-you-should-know/
        http://www.asecuritysite.com/content/icon_clown.gif
        http://www.rottentomatoes.com/m/sleeper/
        http://www.rottentomatoes.com/m/sleeper/trailer/
        http://www.rottentomatoes.com/m/star_wars/
        http://www.rottentomatoes.com/m/star_wars/trailer/
        http://www.rottentomatoes.com/m/wargames/
        http://www.rottentomatoes.com/m/wargames/trailer/
        https://www.sans.org/press/sans-institute-and-crowdstrike-partner-to-offer-hacking-exposed-live-webinar-series.php
        https://www.sans.org/webcasts/archive/2013

duplicate of https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautiful-soup — pigletfly, Nov 30 '14 at 04:57

score 1 · Answer 1 · edited May 23 '17 at 12:15

You should not use regular expressions for parsing HTML. There are specialized tools called HTML parsers.

Here's an example using BeautifulSoup and requests:

from bs4 import BeautifulSoup
import requests

page = requests.get('http://www.soc.napier.ac.uk/~cs342/CSN08115/cw_webpage/index.html')
soup = BeautifulSoup(page.content)

for link in soup.find_all('a', href=True):
    print link.get('href')

Prints:

http://www.rottentomatoes.com/m/sleeper/
http://www.rottentomatoes.com/m/sleeper/trailer/
http://www.rottentomatoes.com/m/wargames/
http://www.rottentomatoes.com/m/wargames/trailer/
...

score 0 · Accepted Answer · answered Nov 30 '14 at 05:00

0

\w+://\w+\.\w+\.\w+[^"]+

Try this.See demo.

http://regex101.com/r/hQ9xT1/31

answered Nov 30 '14 at 05:00

vks

67,027
10
91
124

Avinash Raj · Answer 3 · 2014-11-30T05:34:17.683

Through Beautifulsoup CSS selectors.

>>> from bs4 import BeautifulSoup
>>> import requests
>>> page = requests.get('http://www.soc.napier.ac.uk/~cs342/CSN08115/cw_webpage/index.html')
>>> soup = BeautifulSoup(page.content)
>>> for i in soup.select('a[href]'):
        print(i['href'])

http://www.rottentomatoes.com/m/sleeper/
http://www.rottentomatoes.com/m/sleeper/trailer/
http://www.rottentomatoes.com/m/wargames/
http://www.rottentomatoes.com/m/wargames/trailer/
..................

How do I parse only the links from a webpage in Python?

3 Answers3