Regular expression to match word instances not in html attrs or link text

Question

I want to metch a keyword that is not linked, as the following example shows, I just match the google keyword that is neither between <a></a> nor included in the attributes, I only want to match the last google:

<a href="http://www.google.com" title="google">google</a> is linked, google is not linked.

Can you be more specific? Unfortunately, I'm struggling to understand the question. Where are you making the comparison, database, programming language, etc. What have you currently tried which fails? — Dave Rix, Jul 06 '10 at 10:25
Will you consider a non-regex solution, or do you insist on a hacked up regex? — polygenelubricants, Jul 06 '10 at 10:25
@OP you might want to specify which language you are using too (so that a non-regex alternative can be suggested). regexs are not the best thing for parsing HTML as pointed out in the answers. if you really really need to use regex you might want to say why you'd prefer a regex solution. — potatopeelings, Jul 06 '10 at 10:49
Thanks for your suggestion, I want to implement it with JavaScript. Due to my poor English, I cannot describe my question clearly, I am sorry for it! — James Tang, Jul 08 '10 at 08:12

score 5 · Answer 1 · edited May 23 '17 at 10:24

5

Do not parse HTML with regular expressions. HTML is an irregular language. Use a HTML parser.

edited May 23 '17 at 10:24

Community

1
1

answered Jul 06 '10 at 10:25

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

Jens · Answer 2 · 2010-07-07T07:40:30.880

0

Provided you can be sure that your HTML is well behaved (and valid), especially does not contain comments or nested a tags, you can try

google(?!((?!<a[\s>]).)*</a>)

That matches any "google" that is not followed by a closing a tag before the next opening a tag. But you might be better of using a HTML Parser instead.

edited Jul 07 '10 at 07:40

answered Jul 06 '10 at 11:52

Jens

25,229
9
75
117

@Jens, `(\s|>)` would be better written as a character class: `[\s>]`. A character class is much, much more efficient than an equivalent alternation. It probably doesn't matter in this case, see this recent question for a demonstration: http://stackoverflow.com/questions/3176825/unicode-regular-expressions-fails-at-343-characters – Alan Moore Jul 06 '10 at 12:57
-1 for parsing HTML with a regular expression; this regex can mismatch with XHTML CDATA or HTML comments. – Borealid Jul 07 '10 at 07:44
@Borealid: Thats why I said that the HTML should not contain comments. I agree that this is not the way the problem SHOULD be solved, but I don't think the standard "regex is evil" answer is going to help the OP with his problem in any way. – Jens Jul 07 '10 at 08:35
This pattern also matches the keyword (google) in the html attributes, such as XXX, which I do not want to be matched. Thanks all the same! – James Tang Jul 08 '10 at 08:17

score 0 · Answer 3 · answered Jul 06 '10 at 15:56

0

This works for me (javascript):

var matches = str.match(/(?:<a[^>]*>[^<]*<\/a>[\s\S]*)*(google)/);

See it in action

answered Jul 06 '10 at 15:56

gblazex

49,155
12
98
91

Regular expression to match word instances not in html attrs or link text

3 Answers3