crawling web page through python regular expression

Question

sorry i am new HTML, pleases understand though my question is trivial.

i want to build simple search engine using python.

for that, first, i need to build a crawler to get linked URLs.

and i want to use regular expression to extract linked URLs.

so i did study, but i don't know the exact pattern for link in HTML.

from urllib import urlopen
import re

webPage = urlopen('http://web.cs.dartmouth.edu/').read()
linkedPage = re.findall(r'what should be filled in here?', webPage)

score 4 · Accepted Answer · edited May 23 '17 at 12:16

4

There are tools specifically for parsing HTML - these are called HTML Parsers.

Example, using BeautifulSoup:

from urllib2 import urlopen
from bs4 import BeautifulSoup

soup = BeautifulSoup(urlopen('http://web.cs.dartmouth.edu/'))
for article in soup.select('div.view-content article'):
    print article.text

Prints all of the articles on the page:

Prof Sean Smith receives best paper of 2014 award
...
Lorenzo Torresani wins the Google Faculty Research Award
...

Also see the reasons why using regex for parsing HTML should be avoided:

RegEx match open tags except XHTML self-contained tags

edited May 23 '17 at 12:16

Community

1
1

answered Aug 29 '14 at 13:57

alecxe

462,703
120
1,088
1,195

So if i want to extract linked URLs in the webpage using BeautifulSoup, how can i use it ? – SangminKim Aug 30 '14 at 04:22

crawling web page through python regular expression

1 Answers1