regular expression match starting clause with end

Question

I want to be able to capture the value of an HTML attribute with a python regexp. currently I use

re.compile( r'=(["\'].*?["\'])', re.IGNORECASE | re.DOTALL )

My problem is that I want the regular expression to "remember" whether the attribute started with a single or a double quote.

I found the bug in my current approach with the following attribute

href="javascript:foo('bar')"

my regex catches

"javascript:foo('

This is precisely why you don't parse HTML with regex. There are just too many corner cases. Grab yourself a copy of [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) and just do it the right way. I guarantee that it will be easier (seriously). — Blender, Nov 01 '12 at 09:29

score 3 · Accepted Answer · edited May 23 '17 at 12:21

3

You can capture the first quote and then use a backreference:

r'=((["\']).*?\2)'

However, regular expressions are not the proper approach to parsing HTML. You should consider using a DOM parser instead.

edited May 23 '17 at 12:21

Community

1
1

answered Nov 01 '12 at 09:25

Martin Ender

43,427
11
90
130

2

I just wanted to highlightthe don't do it in the answer. *Don't do it*. Every time someone tries to parse HTML (or worse: XML) with a regular expression some SO users kill a puppy! – Martin Thurau Nov 01 '12 at 09:28

georg · Answer 2 · 2012-11-01T10:07:24.057

The following would be more efficient in theory:

regex = r'"[^"]*"|\'[^']*\''

For the reference, here's Jeffrey Friedl's expression for html tags (from the owl book):

<              # Opening "<"
  (            #    Any amount of . . . 
     "[^"]*"   #      double-quoted string,
     |         #      or . . . 
     '[^']*'   #      single-quoted string,
     |         #      or . . . 
     [^'">]    #      "other stuff"
  )*           #
>              # Closing ">"

regular expression match starting clause with end

2 Answers2