0

Here is the Python 2.5 code (which replace the word fox with a link<a href="/fox">fox</a>, and it avoided the replacement inside a link):

import re

content="""
<div>
    <p>The quick brown <a href='http://en.wikipedia.org/wiki/Fox'>fox</a> jumped over the lazy Dog</p>
    <p>The <a href='http://en.wikipedia.org/wiki/Dog'>dog</a>, who was, in reality, not so lazy, gave chase to the fox.</p>
    <p>See &quot;Dog chase Fox&quot; image for reference:</p>
    <img src='dog_chasing_fox.jpg' title='Dog chasing fox'/>
</div>
"""

p=re.compile(r'(?!((<.*?)|(<a.*?)))(fox)(?!(([^<>]*?)>)|([^>]*?</a>))',re.IGNORECASE|re.MULTILINE)
print p.findall(content)

for match in p.finditer(content):
  print match.groups()

output=p.sub(r'<a href="/fox">\3</a>',content)
print output

The output is:

[('', '', '', 'fox', '', '.', ''), ('', '', '', 'Fox', '', '', '')]
('', '', None, 'fox', '', '.', '')
('', '', None, 'Fox', None, None, None)

Traceback (most recent call last):
  File "C:/example.py", line 18, in <module>
    output=p.sub(r'<a href="fox">\3</a>',content)
  File "C:\Python25\lib\re.py", line 274, in filter
    return sre_parse.expand_template(template, match)
  File "C:\Python25\lib\sre_parse.py", line 793, in expand_template
    raise error, "unmatched group"
error: unmatched group
  1. I am not sure why the backreference \3 wont work.

  2. (?!((<.*?)|(<a.*?)))(fox)(?!(([^<>]*?)>)|([^>]*?</a>)) works see http://regexr.com?317bn , which is surprising. The first negative lookahead (?!((<.*?)|(<a.*?))) puzzles me. In my opinion, it is not supposed to work. Take the first match it finds, fox in gave chase to the fox.</p>, there is a <a href='http://en.wikipedia.org/wiki/Dog'>dog</a> where matches ((<.*?)|(<a.*?)), and as a negative lookahead, it should return a FALSE. I am not sure I express myself clearly or not.

Thanks a lot!

(Note: I hate using BeautifulSoup. I enjoy writing my own regular expression. I know many people here will say Regular expression is not for HTML processing blah blah. But this is a small program, so I prefer Regular expression over BeautifulSoup)

Susan Mayer
  • 335
  • 1
  • 3
  • 12
  • 7
    *Why* are you doing this with regex? [Python has such a nice HTML parser.](http://www.crummy.com/software/BeautifulSoup/) Hint: HTML cannot be parsed with regular expressions. You waste your time trying. – Tomalak Jun 10 '12 at 12:17
  • 1
    @Tomalak I add a note to the question. this is a tiny program so it wont worth the effort of importing and learning Beautifulsoup. I enjoy regular expression. – Susan Mayer Jun 10 '12 at 12:26
  • 5
    Susan, it does not matter in the least if you enjoy regular expressions. This is like playing Golf with a hammer and saying *"I don't have the time to learn what's a 9-iron. I enjoy wielding a hammer."* It is completely the wrong tool for the job. The time it took you to a) try and fail to figure out a regex that works and b) write a question here would have been better invested learning BeautifulSoup. It's not that it would be *hard* or anything. You'd very probably be done already. – Tomalak Jun 10 '12 at 12:31
  • 4
    Tomalak is spot on... for a tongue-in-cheek discussion that is quite relevant, see [**this answer**](http://stackoverflow.com/a/1732454/667301) – Mike Pennington Jun 10 '12 at 12:35
  • 1
    Disclaimer: I enjoy regular expressions, too. [I really do.](http://stackoverflow.com/badges/133/regex?userid=18771) Take my word for it that you are wasting your time. *"I hate using the right tool for the job"* is not a valid argument when asking for help. For a programmer, doubly so. – Tomalak Jun 10 '12 at 12:37
  • 1
    @Tomalak: the statement "HTML cannot be parsed with regular expressions" is vague in many ways. At least, define your terms. What do you mean by "parse"? What do you mean by "regular expressions"? (hint: re's in programming languages are not "regular language"), etc... – georg Jun 10 '12 at 12:47
  • 1
    @Tomalak Thanks a lot! But I still want to know why the backreference wont work here and why the first negative lookahead works. – Susan Mayer Jun 10 '12 at 12:49
  • 2
    @thg435 The statement is pretty solid. I'm not sure what you mean by "define regular expressions"? How, exactly, is this term a matter of definition? My point is that dissecting a string of HTML into its meaningful parts (i.e. "parsing") is impossible with regular expressions because REs can match *regular* languages. HTML is not a regular language. It's a [context-free language](http://stackoverflow.com/a/5207677/18771), which is a language that REs are not fit to parse. – Tomalak Jun 10 '12 at 12:51
  • @Susan Don't say "thank you" if you don't mean it. ;) – Tomalak Jun 10 '12 at 12:52
  • @Tomalak: modern RE engines can match regular languages and [much more](http://en.wikipedia.org/wiki/Regular_expression#Patterns_for_non-regular_languages) – georg Jun 10 '12 at 12:53
  • @thg435 Does the regex engine of Python? Besides, even Perl 6's regular expressions are unfit to parse HTML without errors. And besides *all that*, where is the point in not using a parser when you have one available? Everybody can use regex to handle HTML as much as they want, they just should be able to handle regex on a level that allows them not to tell anyone about it. – Tomalak Jun 10 '12 at 12:54
  • 2
    @Tomalak: Well, the point one is curiosity that drives progress. The point too is that most "parsers", including BS, actually do use regular expressions to "parse html", whatever that means. – georg Jun 10 '12 at 13:09
  • I agree with thg435. I believe Beautifulsoup uses regular expression heavily. – Susan Mayer Jun 10 '12 at 13:12
  • 1
    @thg435 Wow, I didn't notice we're on the "whatever that means" level already. What a shady way to lead an argument. Let's just hope you're a little ashamed of yourself for pulling such a trick. Anyway. *Of course* BeautifulSoup uses regular expressions. It's the *other code* it contains that makes the difference. If HTML could be parsed with regex alone, HTML parsers would not exist, or would they? I can't believe I must explain this. – Tomalak Jun 10 '12 at 13:16
  • 1
    @Tomalak: there's no trick. In order to have a senseless argument, you have to define terms. If by "parsing html" you mean _tokenizing_ (and one of your previous comments makes me think so), then RE's is actually a suitable and even preferred tool. If you mean something else, please explain yourself. – georg Jun 10 '12 at 13:22
  • @Tomalak Instead of arguing RE vs. BS, could you please take a look at my questions? The result is here http://regexr.com/?317bn since you get a badget for regex. – Susan Mayer Jun 10 '12 at 13:23
  • @thg435 I mean correctly recognizing the individual parts of an HTML string. In order to make changes to them and save them back without breaking the markup. In order to not replace the word `"fox"` in comments. Or attribute values. Undisturbed by invalidly nested markup. I don't mean "tokenizing", which is a small part of a parsing process. I did not know that "parsing" had multiple meanings in CS. Sorry, but "explain your terms" might be an acceptable move in an area that's less well-discussed. Here, it's not. – Tomalak Jun 10 '12 at 13:32
  • 4
    @Susan I'm sorry this got out-of-hand a little. I understand your intentions and I'm sure you understand mine as well. Although it looks like I'm just being cocky about it, I'm actually trying to help you. I'm the guy that tries to hand you a 9-iron while you keep asking why you can't seem to go 100 yards with a hammer. I probably could fix your regex. You could also just use a parser. A valid argument *against* doing that has yet to be made. – Tomalak Jun 10 '12 at 13:39
  • @Tomalak: this is where you should have started - just answer the question (if you can). Isn't it a shame for Stackoverflow when a poster gets 100 tons of bullshit for one meaningful answer? – georg Jun 10 '12 at 14:12
  • 3
    @thg435 I challenge you to point out the bullshit. Not every question deserves the response the OP had in mind. This does not automatically make the response bullshit. I can answer the question. I won't. Because it would be the Wrong Thing to do. That's a good reason. The OP could use a parser. She won't. No reason given. That's an indefensible position. – Tomalak Jun 10 '12 at 14:19
  • @Tomalak: this is a programming board. Programming is about solving problems under given constraints, no matter how absurd they appear at the first glance. If you prefer debating about Right and Wrong Things, the philosophy class is right down the hall. – georg Jun 10 '12 at 14:28
  • 3
    @thg435 "But I *want* to use a hammer!" - "Sure thing, let me help you with this. This is a Golf training course, after all. We're here to teach you Golf, not philosophy." - Uhm, sorry, doesn't work. If I sense that there's an advanced but seemingly absurd question, I'm happy to help. I go out of my way, spending hours, even days thinking about it. "How can I make my regex work with HTML?" generally does not fall into the advanced questions category. – Tomalak Jun 10 '12 at 14:34

2 Answers2

3

If you don't like beautifulsoup, try one of these other (X)HTML parsers:

html5lib
elementree
lxml

If you ever plan to, or need to, parse HTML (or variant) it is worth learning these tools.

Keith
  • 42,110
  • 11
  • 57
  • 76
1

I don't know why your expressions don't work, the only thing that I noticed is a lookahead group at the start, which doesn't make much sense to me. This one appears to work well:

import re

content="""fox
    <a>fox</a> fox <p fox> and <tag fox bar> 
    <a>small <b>fox</b> and</a>
fox"""

rr = """
(fox)
(?! [^<>]*>)
(?!
    (.(?!<a))*
    </a
)
"""

p = re.compile(rr, re.IGNORECASE | re.MULTILINE | re.VERBOSE)
print p.sub(r'((\g<1>))', content)
georg
  • 211,518
  • 52
  • 313
  • 390