Why doesn't this simple Python program print out?

Question

from urllib import urlopen
import re
p = re.compile('<h2><a .*?><a .*? href="(.*?)">(.*?)</a>')
text = urlopen('http://python.org/community/jobs').read()
for url, name in p.findall(text):
    print '%s (%s)' % (name, url)

possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — , May 18 '11 at 17:39
Because parsing HTML using regular expressions is broken-by-design. It is actually not worth a minute spending on checking your regex - just don't do it. Why? Because you will always get it wrong. Use BeautifulSoup or another HTML parser. — , May 18 '11 at 17:38

score 3 · Accepted Answer · answered May 18 '11 at 17:42

3

Your regex isn't what you want. Try this instead:

from urllib import urlopen
import re
p = re.compile(r'<h2><a\s.*?href="(.*?)">(.*?)</a>')
text = urlopen('http://python.org/community/jobs').read()
print text
for url, name in p.findall(text):
    print '%s (%s)' % (name, url)

Also, your way of going about this is probably not the best idea. That said, I'm answering the question as asked. :)

answered May 18 '11 at 17:42

Luke Sneeringer

9,270
2
35
32

I'm inclined to agree that regular expressions probably aren't the best way to solve this problem, although if using a known resource, it's probably not the end of the world. Besides, I don't know what implementation restrictions he might have. Crawling markup is, in general, a problem-ridden science. – Luke Sneeringer May 18 '11 at 17:47
Heck yeah, that worked! Thanks. This was code from a Python book. Teh book uses Beautiful Soup and the code works using that. But I was just curious why that particular code once worked and now it didn't. Thanks, I appreciate it! :) – user759691 May 18 '11 at 17:52

Why doesn't this simple Python program print out?

1 Answers1