Python Regular Expression: BackReference

Question

Here is the Python 2.5 code (which replace the word fox with a link<a href="/fox">fox</a>, and it avoided the replacement inside a link):

import re

content="""
<div>
    <p>The quick brown <a href='http://en.wikipedia.org/wiki/Fox'>fox</a> jumped over the lazy Dog</p>
    <p>The <a href='http://en.wikipedia.org/wiki/Dog'>dog</a>, who was, in reality, not so lazy, gave chase to the fox.</p>
    <p>See &quot;Dog chase Fox&quot; image for reference:</p>
    <img src='dog_chasing_fox.jpg' title='Dog chasing fox'/>
</div>
"""

p=re.compile(r'(?!((<.*?)|(<a.*?)))(fox)(?!(([^<>]*?)>)|([^>]*?</a>))',re.IGNORECASE|re.MULTILINE)
print p.findall(content)

for match in p.finditer(content):
  print match.groups()

output=p.sub(r'<a href="/fox">\3</a>',content)
print output

The output is:

[('', '', '', 'fox', '', '.', ''), ('', '', '', 'Fox', '', '', '')]
('', '', None, 'fox', '', '.', '')
('', '', None, 'Fox', None, None, None)

Traceback (most recent call last):
  File "C:/example.py", line 18, in <module>
    output=p.sub(r'<a href="fox">\3</a>',content)
  File "C:\Python25\lib\re.py", line 274, in filter
    return sre_parse.expand_template(template, match)
  File "C:\Python25\lib\sre_parse.py", line 793, in expand_template
    raise error, "unmatched group"
error: unmatched group

I am not sure why the backreference \3 wont work.
(?!((<.*?)|(<a.*?)))(fox)(?!(([^<>]*?)>)|([^>]*?</a>)) works see http://regexr.com?317bn , which is surprising. The first negative lookahead (?!((<.*?)|(<a.*?))) puzzles me. In my opinion, it is not supposed to work. Take the first match it finds, fox in gave chase to the fox.</p>, there is a <a href='http://en.wikipedia.org/wiki/Dog'>dog</a> where matches ((<.*?)|(<a.*?)), and as a negative lookahead, it should return a FALSE. I am not sure I express myself clearly or not.

Thanks a lot!

(Note: I hate using BeautifulSoup. I enjoy writing my own regular expression. I know many people here will say Regular expression is not for HTML processing blah blah. But this is a small program, so I prefer Regular expression over BeautifulSoup)

*Why* are you doing this with regex? [Python has such a nice HTML parser.](http://www.crummy.com/software/BeautifulSoup/) Hint: HTML cannot be parsed with regular expressions. You waste your time trying. — Tomalak, Jun 10 '12 at 12:17
@Tomalak I add a note to the question. this is a tiny program so it wont worth the effort of importing and learning Beautifulsoup. I enjoy regular expression. — Susan Mayer, Jun 10 '12 at 12:26
Susan, it does not matter in the least if you enjoy regular expressions. This is like playing Golf with a hammer and saying *"I don't have the time to learn what's a 9-iron. I enjoy wielding a hammer."* It is completely the wrong tool for the job. The time it took you to a) try and fail to figure out a regex that works and b) write a question here would have been better invested learning BeautifulSoup. It's not that it would be *hard* or anything. You'd very probably be done already. — Tomalak, Jun 10 '12 at 12:31
Tomalak is spot on... for a tongue-in-cheek discussion that is quite relevant, see [**this answer**](http://stackoverflow.com/a/1732454/667301) — Mike Pennington, Jun 10 '12 at 12:35
Disclaimer: I enjoy regular expressions, too. [I really do.](http://stackoverflow.com/badges/133/regex?userid=18771) Take my word for it that you are wasting your time. *"I hate using the right tool for the job"* is not a valid argument when asking for help. For a programmer, doubly so. — Tomalak, Jun 10 '12 at 12:37
@Tomalak: the statement "HTML cannot be parsed with regular expressions" is vague in many ways. At least, define your terms. What do you mean by "parse"? What do you mean by "regular expressions"? (hint: re's in programming languages are not "regular language"), etc... — georg, Jun 10 '12 at 12:47
@Tomalak Thanks a lot! But I still want to know why the backreference wont work here and why the first negative lookahead works. — Susan Mayer, Jun 10 '12 at 12:49
@thg435 The statement is pretty solid. I'm not sure what you mean by "define regular expressions"? How, exactly, is this term a matter of definition? My point is that dissecting a string of HTML into its meaningful parts (i.e. "parsing") is impossible with regular expressions because REs can match *regular* languages. HTML is not a regular language. It's a [context-free language](http://stackoverflow.com/a/5207677/18771), which is a language that REs are not fit to parse. — Tomalak, Jun 10 '12 at 12:51
@Tomalak: modern RE engines can match regular languages and [much more](http://en.wikipedia.org/wiki/Regular_expression#Patterns_for_non-regular_languages) — georg, Jun 10 '12 at 12:53
@thg435 Does the regex engine of Python? Besides, even Perl 6's regular expressions are unfit to parse HTML without errors. And besides *all that*, where is the point in not using a parser when you have one available? Everybody can use regex to handle HTML as much as they want, they just should be able to handle regex on a level that allows them not to tell anyone about it. — Tomalak, Jun 10 '12 at 12:54
@Tomalak: Well, the point one is curiosity that drives progress. The point too is that most "parsers", including BS, actually do use regular expressions to "parse html", whatever that means. — georg, Jun 10 '12 at 13:09
I agree with thg435. I believe Beautifulsoup uses regular expression heavily. — Susan Mayer, Jun 10 '12 at 13:12
@thg435 Wow, I didn't notice we're on the "whatever that means" level already. What a shady way to lead an argument. Let's just hope you're a little ashamed of yourself for pulling such a trick. Anyway. *Of course* BeautifulSoup uses regular expressions. It's the *other code* it contains that makes the difference. If HTML could be parsed with regex alone, HTML parsers would not exist, or would they? I can't believe I must explain this. — Tomalak, Jun 10 '12 at 13:16
@Tomalak: there's no trick. In order to have a senseless argument, you have to define terms. If by "parsing html" you mean _tokenizing_ (and one of your previous comments makes me think so), then RE's is actually a suitable and even preferred tool. If you mean something else, please explain yourself. — georg, Jun 10 '12 at 13:22
@Tomalak Instead of arguing RE vs. BS, could you please take a look at my questions? The result is here http://regexr.com/?317bn since you get a badget for regex. — Susan Mayer, Jun 10 '12 at 13:23
@thg435 I mean correctly recognizing the individual parts of an HTML string. In order to make changes to them and save them back without breaking the markup. In order to not replace the word `"fox"` in comments. Or attribute values. Undisturbed by invalidly nested markup. I don't mean "tokenizing", which is a small part of a parsing process. I did not know that "parsing" had multiple meanings in CS. Sorry, but "explain your terms" might be an acceptable move in an area that's less well-discussed. Here, it's not. — Tomalak, Jun 10 '12 at 13:32
@Susan I'm sorry this got out-of-hand a little. I understand your intentions and I'm sure you understand mine as well. Although it looks like I'm just being cocky about it, I'm actually trying to help you. I'm the guy that tries to hand you a 9-iron while you keep asking why you can't seem to go 100 yards with a hammer. I probably could fix your regex. You could also just use a parser. A valid argument *against* doing that has yet to be made. — Tomalak, Jun 10 '12 at 13:39
@Tomalak: this is where you should have started - just answer the question (if you can). Isn't it a shame for Stackoverflow when a poster gets 100 tons of bullshit for one meaningful answer? — georg, Jun 10 '12 at 14:12
@thg435 I challenge you to point out the bullshit. Not every question deserves the response the OP had in mind. This does not automatically make the response bullshit. I can answer the question. I won't. Because it would be the Wrong Thing to do. That's a good reason. The OP could use a parser. She won't. No reason given. That's an indefensible position. — Tomalak, Jun 10 '12 at 14:19
@Tomalak: this is a programming board. Programming is about solving problems under given constraints, no matter how absurd they appear at the first glance. If you prefer debating about Right and Wrong Things, the philosophy class is right down the hall. — georg, Jun 10 '12 at 14:28
@thg435 "But I *want* to use a hammer!" - "Sure thing, let me help you with this. This is a Golf training course, after all. We're here to teach you Golf, not philosophy." - Uhm, sorry, doesn't work. If I sense that there's an advanced but seemingly absurd question, I'm happy to help. I go out of my way, spending hours, even days thinking about it. "How can I make my regex work with HTML?" generally does not fall into the advanced questions category. — Tomalak, Jun 10 '12 at 14:34

score 3 · Answer 1 · answered Jun 10 '12 at 13:03

3

If you don't like beautifulsoup, try one of these other (X)HTML parsers:

html5lib
elementree
lxml

If you ever plan to, or need to, parse HTML (or variant) it is worth learning these tools.

answered Jun 10 '12 at 13:03

Keith

42,110
11
57
76

score 1 · Accepted Answer · answered Jun 10 '12 at 12:41

1

I don't know why your expressions don't work, the only thing that I noticed is a lookahead group at the start, which doesn't make much sense to me. This one appears to work well:

import re

content="""fox
    <a>fox</a> fox <p fox> and <tag fox bar> 
    <a>small <b>fox</b> and</a>
fox"""

rr = """
(fox)
(?! [^<>]*>)
(?!
    (.(?!<a))*
    </a
)
"""

p = re.compile(rr, re.IGNORECASE | re.MULTILINE | re.VERBOSE)
print p.sub(r'((\g<1>))', content)

answered Jun 10 '12 at 12:41

georg

211,518
52
313
390

"I don't know why your expressions don't work" is part of the reason why using regex for parsing HTML is a bad idea. They always turn into a maintenance nightmare that nobody can fix. – Tomalak Jun 10 '12 at 12:44
@thg435 The regular in my code works but the backreference wont work. See here: http://regexr.com/?317bn my regular expression matches what I want but the backreference wont work. – Susan Mayer Jun 10 '12 at 13:01
@Tomalak In this particular case, I am the only person who needs to maintain this tiny piece of code. the problem is I dont know why the backreference wont work. – Susan Mayer Jun 10 '12 at 13:03
@Susan: as seen in your `findall` output, "fox" is group 4, not 3. – georg Jun 10 '12 at 13:05
@thg435 Is it indexed from 1? I think it is indexed from 0. so 0,1,2,3 – Susan Mayer Jun 10 '12 at 13:07
@thg435 What is the second part of your regular expression? `(?! (.(?! – Susan Mayer Jun 10 '12 at 13:10
@thg435 yours wont work. it mathes two extra "fox". See the result here: http://regexr.com?317bt – Susan Mayer Jun 10 '12 at 13:13
@Susan: you've got one extra space in there ;) – georg Jun 10 '12 at 13:15
@thg435 Huh? Please post your code here: http://regexr.com/?317bt and post the link back. – Susan Mayer Jun 10 '12 at 13:25
1

@Susan: there's a space after '?!' that shouldn't be there. – georg Jun 10 '12 at 13:27

Python Regular Expression: BackReference

2 Answers2