1

I want to be able to capture the value of an HTML attribute with a python regexp. currently I use

re.compile( r'=(["\'].*?["\'])', re.IGNORECASE | re.DOTALL )

My problem is that I want the regular expression to "remember" whether the attribute started with a single or a double quote.

I found the bug in my current approach with the following attribute

href="javascript:foo('bar')"

my regex catches

"javascript:foo('
elewinso
  • 2,453
  • 5
  • 22
  • 27
  • 2
    This is precisely why you don't parse HTML with regex. There are just too many corner cases. Grab yourself a copy of [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) and just do it the right way. I guarantee that it will be easier (seriously). – Blender Nov 01 '12 at 09:29

2 Answers2

3

You can capture the first quote and then use a backreference:

r'=((["\']).*?\2)'

However, regular expressions are not the proper approach to parsing HTML. You should consider using a DOM parser instead.

Community
  • 1
  • 1
Martin Ender
  • 43,427
  • 11
  • 90
  • 130
  • 2
    I just wanted to highlightthe don't do it in the answer. *Don't do it*. Every time someone tries to parse HTML (or worse: XML) with a regular expression some SO users kill a puppy! – Martin Thurau Nov 01 '12 at 09:28
1

The following would be more efficient in theory:

regex = r'"[^"]*"|\'[^']*\''

For the reference, here's Jeffrey Friedl's expression for html tags (from the owl book):

<              # Opening "<"
  (            #    Any amount of . . . 
     "[^"]*"   #      double-quoted string,
     |         #      or . . . 
     '[^']*'   #      single-quoted string,
     |         #      or . . . 
     [^'">]    #      "other stuff"
  )*           #
>              # Closing ">"
georg
  • 211,518
  • 52
  • 313
  • 390