from urllib import urlopen
import re
p = re.compile('<h2><a .*?><a .*? href="(.*?)">(.*?)</a>')
text = urlopen('http://python.org/community/jobs').read()
for url, name in p.findall(text):
print '%s (%s)' % (name, url)
Asked
Active
Viewed 251 times
-1

joaquin
- 82,968
- 29
- 138
- 152

user759691
- 1
- 1
-
1possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – May 18 '11 at 17:39
-
1Your regex doesn't match – Kai May 18 '11 at 17:41
-
Because parsing HTML using regular expressions is broken-by-design. It is actually not worth a minute spending on checking your regex - just don't do it. Why? Because you will always get it wrong. Use BeautifulSoup or another HTML parser. – May 18 '11 at 17:38
1 Answers
3
Your regex isn't what you want. Try this instead:
from urllib import urlopen
import re
p = re.compile(r'<h2><a\s.*?href="(.*?)">(.*?)</a>')
text = urlopen('http://python.org/community/jobs').read()
print text
for url, name in p.findall(text):
print '%s (%s)' % (name, url)
Also, your way of going about this is probably not the best idea. That said, I'm answering the question as asked. :)

Luke Sneeringer
- 9,270
- 2
- 35
- 32
-
I'm inclined to agree that regular expressions probably aren't the best way to solve this problem, although if using a known resource, it's probably not the end of the world. Besides, I don't know what implementation restrictions he might have. Crawling markup is, in general, a problem-ridden science. – Luke Sneeringer May 18 '11 at 17:47
-
Heck yeah, that worked! Thanks. This was code from a Python book. Teh book uses Beautiful Soup and the code works using that. But I was just curious why that particular code once worked and now it didn't. Thanks, I appreciate it! :) – user759691 May 18 '11 at 17:52