-2

I am a young grasshopper in need of your help. I've done a lot of research and can't seem to find the solution. I written the following code below. When ran it doesn't pull any of the titles. I believe my regular expressions are correct. Not sure what the problem is. Probably obvious to a seasoned sensei. Thanks in advance.

from urllib import urlopen

import re

url = urlopen('http://www.realclearpolitics.com/epolls/2012/senate/ma/massachusetts_senate_brown_vs_warren-2093.html#polls').read()

'''
a href="http://multimedia.heraldinteractive.com/misc/umlrvnov2012final.pdf">Title a>
'''

A = 'a href.*pdf">(expression to pull everything) a>' 

B = re.compile(A) 

C = re.findall(B,url)

print C
Pavel Anossov
  • 60,842
  • 14
  • 151
  • 124
  • 4
    Don't try to parse HTML with a regular expression. Read http://stackoverflow.com/a/1732454/110707 – Wooble Jan 03 '13 at 18:52

2 Answers2

3

This comes up pretty often here on SO. Rather than using Regular Expressions you should be using an HTML parser that allows you to search/traverse the document tree.

I would use BeautifulSoup:

Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."

>>> from bs4 import BeautifulSoup
>>> html = ? # insert your raw HTML here
>>> soup = BeautifulSoup(html)
>>> a_tags = soup.find_all("a")
>>> for anchor in a_tags:
>>> ...     print anchor.contents
brian buck
  • 3,284
  • 3
  • 23
  • 24
0

I'll echo the other comment about not using RegEx for parsing HTML, but sometimes it is quick and easy. It looks like the HTML in your example is not quite correct, but I'd try something like:

re.findall('href.*?pdf">(.+?)<\/a>', A)
woemler
  • 7,089
  • 7
  • 48
  • 67