-1
from urllib import urlopen
import re
p = re.compile('<h2><a .*?><a .*? href="(.*?)">(.*?)</a>')
text = urlopen('http://python.org/community/jobs').read()
for url, name in p.findall(text):
    print '%s (%s)' % (name, url)
joaquin
  • 82,968
  • 29
  • 138
  • 152
  • 1
    possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) –  May 18 '11 at 17:39
  • 1
    Your regex doesn't match – Kai May 18 '11 at 17:41
  • Because parsing HTML using regular expressions is broken-by-design. It is actually not worth a minute spending on checking your regex - just don't do it. Why? Because you will always get it wrong. Use BeautifulSoup or another HTML parser. –  May 18 '11 at 17:38

1 Answers1

3

Your regex isn't what you want. Try this instead:

from urllib import urlopen
import re
p = re.compile(r'<h2><a\s.*?href="(.*?)">(.*?)</a>')
text = urlopen('http://python.org/community/jobs').read()
print text
for url, name in p.findall(text):
    print '%s (%s)' % (name, url)

Also, your way of going about this is probably not the best idea. That said, I'm answering the question as asked. :)

Luke Sneeringer
  • 9,270
  • 2
  • 35
  • 32
  • I'm inclined to agree that regular expressions probably aren't the best way to solve this problem, although if using a known resource, it's probably not the end of the world. Besides, I don't know what implementation restrictions he might have. Crawling markup is, in general, a problem-ridden science. – Luke Sneeringer May 18 '11 at 17:47
  • Heck yeah, that worked! Thanks. This was code from a Python book. Teh book uses Beautiful Soup and the code works using that. But I was just curious why that particular code once worked and now it didn't. Thanks, I appreciate it! :) – user759691 May 18 '11 at 17:52