2

sorry i am new HTML, pleases understand though my question is trivial.

i want to build simple search engine using python.

for that, first, i need to build a crawler to get linked URLs.

and i want to use regular expression to extract linked URLs.

so i did study, but i don't know the exact pattern for link in HTML.

from urllib import urlopen
import re

webPage = urlopen('http://web.cs.dartmouth.edu/').read()
linkedPage = re.findall(r'what should be filled in here?', webPage)
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
SangminKim
  • 8,358
  • 14
  • 69
  • 125

1 Answers1

4

There are tools specifically for parsing HTML - these are called HTML Parsers.

Example, using BeautifulSoup:

from urllib2 import urlopen
from bs4 import BeautifulSoup

soup = BeautifulSoup(urlopen('http://web.cs.dartmouth.edu/'))
for article in soup.select('div.view-content article'):
    print article.text

Prints all of the articles on the page:

Prof Sean Smith receives best paper of 2014 award
...
Lorenzo Torresani wins the Google Faculty Research Award
...

Also see the reasons why using regex for parsing HTML should be avoided:

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195