0

I want to metch a keyword that is not linked, as the following example shows, I just match the google keyword that is neither between <a></a> nor included in the attributes, I only want to match the last google:

<a href="http://www.google.com" title="google">google</a> is linked, google is not linked.

Brad Mace
  • 27,194
  • 17
  • 102
  • 148
James Tang
  • 593
  • 6
  • 13
  • Can you be more specific? Unfortunately, I'm struggling to understand the question. Where are you making the comparison, database, programming language, etc. What have you currently tried which fails? – Dave Rix Jul 06 '10 at 10:25
  • Will you consider a non-regex solution, or do you insist on a hacked up regex? – polygenelubricants Jul 06 '10 at 10:25
  • @OP you might want to specify which language you are using too (so that a non-regex alternative can be suggested). regexs are not the best thing for parsing HTML as pointed out in the answers. if you really really need to use regex you might want to say why you'd prefer a regex solution. – potatopeelings Jul 06 '10 at 10:49
  • Thanks for your suggestion, I want to implement it with JavaScript. Due to my poor English, I cannot describe my question clearly, I am sorry for it! – James Tang Jul 08 '10 at 08:12

3 Answers3

5

Do not parse HTML with regular expressions. HTML is an irregular language. Use a HTML parser.

Community
  • 1
  • 1
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
0

Provided you can be sure that your HTML is well behaved (and valid), especially does not contain comments or nested a tags, you can try

google(?!((?!<a[\s>]).)*</a>)

That matches any "google" that is not followed by a closing a tag before the next opening a tag. But you might be better of using a HTML Parser instead.

Jens
  • 25,229
  • 9
  • 75
  • 117
  • @Jens, `(\s|>)` would be better written as a character class: `[\s>]`. A character class is much, much more efficient than an equivalent alternation. It probably doesn't matter in this case, see this recent question for a demonstration: http://stackoverflow.com/questions/3176825/unicode-regular-expressions-fails-at-343-characters – Alan Moore Jul 06 '10 at 12:57
  • -1 for parsing HTML with a regular expression; this regex can mismatch with XHTML CDATA or HTML comments. – Borealid Jul 07 '10 at 07:44
  • @Borealid: Thats why I said that the HTML should not contain comments. I agree that this is not the way the problem SHOULD be solved, but I don't think the standard "regex is evil" answer is going to help the OP with his problem in any way. – Jens Jul 07 '10 at 08:35
  • This pattern also matches the keyword (google) in the html attributes, such as XXX, which I do not want to be matched. Thanks all the same! – James Tang Jul 08 '10 at 08:17
0

This works for me (javascript):

var matches = str.match(/(?:<a[^>]*>[^<]*<\/a>[\s\S]*)*(google)/);

See it in action

gblazex
  • 49,155
  • 12
  • 98
  • 91