0

Possible Duplicate:
RegEx match open tags except XHTML self-contained tags

Lets say I'm trying to match the word "google" in a string but sometimes that string contains a link (<a href="http://www.google.com">google</a>) but I only want to match if it's not a link.

How can I check if there is a <a href="http://www.google.com"> before the word?

Community
  • 1
  • 1
Undefined
  • 1,899
  • 6
  • 29
  • 38
  • Have some fun http://stackoverflow.com/a/1732454/876211 – Gabber Sep 21 '12 at 14:20
  • 3
    I suggest that we should not link to http://stackoverflow.com/a/1732454/876211 in cases like this. The only people who understand it are the people who *already understand* why parsing HTML with regexes is suboptimal. For a novice, it is meaningless. I could use some help adding content to http://htmlparsing.com where we can *explain* to novices in terms they understand why they shouldn't use regexes for HTML parsing. I've already got a lot of counterexamples: http://htmlparsing.com/regexes.html – Andy Lester Sep 21 '12 at 14:40
  • 1
    Agreed, @Andy, that The Answer is not helpful for a novice someone who wants to extract info from HTML, but there are several other explanatory and useful answers at the same question. We can count these questions as answered there, even if it's not the accepted answer that does it. – jscs Sep 21 '12 at 23:02
  • 1
    @AndyLester I think we need to open something on MSO about this issue. I will do it if need be, but you are more diplomatic than I am. – tchrist Sep 21 '12 at 23:54
  • @tchrist: While I do think that something on MSO would be worthwhile, I still want to get http://htmlparsing.com up for all the other novices who still need to know the right way to do it. – Andy Lester Sep 23 '12 at 20:23

1 Answers1

7

The most accurate approach is to:

  • Parse the string as HTML
  • Search whatever is not a tag for the string "Google".

You don't want to try parsing HTML with regular expressions. It will make you sad in the long run. Please take a look at http://htmlparsing.com/ for some pointers that could get you started.

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
  • 2
    @Undefined Right now someone is working on a regex that will pass your specific test cases but will fail for any number of reasons when you try to apply it in the real world. This answer is a much better approach. – chucksmash Sep 21 '12 at 14:23