-1
links = re.findall(r'\w+://\w+.\w+.\w+\w+\w.+"', page) 

to parse the links from a webpage.

Please any help will be appreciated. This is what I get from parsing http://www.soc.napier.ac.uk/~cs342/CSN08115/cw_webpage/index.html:

        #my current output#
        http://net.tutsplus.com/tutorials/other/8-regular-expressions-you-should-know/"
        http://www.asecuritysite.com/content/icon_clown.gif" alt="if broken see alex@school.ac.uk +44(0)1314552759" height="100"
        http://www.rottentomatoes.com/m/sleeper/"
        http://www.rottentomatoes.com/m/sleeper/trailer/"
        http://www.rottentomatoes.com/m/star_wars/"
        http://www.rottentomatoes.com/m/star_wars/trailer/"
        http://www.rottentomatoes.com/m/wargames/"
        http://www.rottentomatoes.com/m/wargames/trailer/"
        https://www.sans.org/press/sans-institute-and-crowdstrike-partner-to-offer-hacking-exposed-live-webinar-series.php"> SANS to Offer "Hacking Exposed Live"
        https://www.sans.org/webcasts/archive/2013"

        #I want to get this when i run the module#
        http://net.tutsplus.com/tutorials/other/8-regular-expressions-you-should-know/
        http://www.asecuritysite.com/content/icon_clown.gif
        http://www.rottentomatoes.com/m/sleeper/
        http://www.rottentomatoes.com/m/sleeper/trailer/
        http://www.rottentomatoes.com/m/star_wars/
        http://www.rottentomatoes.com/m/star_wars/trailer/
        http://www.rottentomatoes.com/m/wargames/
        http://www.rottentomatoes.com/m/wargames/trailer/
        https://www.sans.org/press/sans-institute-and-crowdstrike-partner-to-offer-hacking-exposed-live-webinar-series.php
        https://www.sans.org/webcasts/archive/2013
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Audu Ibrahim
  • 39
  • 2
  • 9
  • duplicate of https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautiful-soup – pigletfly Nov 30 '14 at 04:57

3 Answers3

1

You should not use regular expressions for parsing HTML. There are specialized tools called HTML parsers.

Here's an example using BeautifulSoup and requests:

from bs4 import BeautifulSoup
import requests

page = requests.get('http://www.soc.napier.ac.uk/~cs342/CSN08115/cw_webpage/index.html')
soup = BeautifulSoup(page.content)

for link in soup.find_all('a', href=True):
    print link.get('href')

Prints:

http://www.rottentomatoes.com/m/sleeper/
http://www.rottentomatoes.com/m/sleeper/trailer/
http://www.rottentomatoes.com/m/wargames/
http://www.rottentomatoes.com/m/wargames/trailer/
...
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
0
\w+://\w+\.\w+\.\w+[^"]+

Try this.See demo.

http://regex101.com/r/hQ9xT1/31

vks
  • 67,027
  • 10
  • 91
  • 124
0

Through Beautifulsoup CSS selectors.

>>> from bs4 import BeautifulSoup
>>> import requests
>>> page = requests.get('http://www.soc.napier.ac.uk/~cs342/CSN08115/cw_webpage/index.html')
>>> soup = BeautifulSoup(page.content)
>>> for i in soup.select('a[href]'):
        print(i['href'])

http://www.rottentomatoes.com/m/sleeper/
http://www.rottentomatoes.com/m/sleeper/trailer/
http://www.rottentomatoes.com/m/wargames/
http://www.rottentomatoes.com/m/wargames/trailer/
..................
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274