Regex replace string but not inside html tag

Question

I want to replace a string in HTML page using JavaScript but ignore it, if it is in an HTML tag, for example:

<a href="google.com">visit google search engine</a>
you can search on google tatatata...

I want to replace google by google, but not here:

<a href="google.com">visit google search engine</a>
you can search on <b>google</b> tatatata...

I tried with this one:

regex = new RegExp(">([^<]*)?(google)([^>]*)?<", 'i');
el.innerHTML =  el.innerHTML.replace(regex,'>$1<b>$2</b>$3<');

but the problem: I got google inside the <a> tag:

<a href="google.com">visit <b>google</b> search engine</a>
you can search on <b>google</b> tatatata...

How can fix this?

score 6 · Answer 1 · answered Jul 21 '09 at 11:35

6

You'd be better using an html parser for this, rather than regex. I'm not sure it can be done 100% reliably.

answered Jul 21 '09 at 11:35

Draemon

33,955
16
77
104

score 5 · Answer 2 · answered Jul 21 '09 at 11:41

5

You may or may not be able to do with with a regexp. It depends on how precisely you can define the conditions. Saying you want the string replaced except if it's in an HTML tag is not narrow enough, since everything on the page is presumably within some HTML tag (BODY if nothing else).

It would probably work better to traverse the DOM tree for this instead of trying to use a regexp on the HTML.

answered Jul 21 '09 at 11:41

jhurshman

5,861
2
26
16

1

I agree. Find all the text nodes in the DOM that contain the string. Keep a blacklist of tags that you **don't** want to replace the string in. Check if the text node is inside one of these tags. If not, do your replacement, otherwise leave it as is. – tvanfosson Jul 21 '09 at 11:45

score 1 · Answer 3 · edited May 23 '17 at 12:03

1

Parsing HTML with a regular expression is not going to be easy for anything other than trivial cases, since HTML isn't regular.

For more details see this Stackoverflow question (and answers).

edited May 23 '17 at 12:03

Community

1
1

answered Jul 21 '09 at 12:00

Brian Agnew

268,207
37
334
440

score 1 · Answer 4 · answered Sep 14 '09 at 11:26

I think you're all missing the question here...

When he says inside the tag, he means inside the opening tag, as in the <a href="google.com"> tag...This is something quite different than text, say, inside a tag pair or <body> </body>. While I don't have the answer yet, I'm struggling with this same problem and I know it has to be solvable using regex. Once I figure it out, i'll come back and post.

naugtur · Answer 5 · 2010-01-29T12:58:37.783

WORKAROUND

If You can't use a html parser or are quite confident about Your html structure try this:

do the "bad" changing
repeat replace (<[^>]*)(<[^>]+>) to $1 a few times (as much as You need)

It's a simple workaround, but works for me.

Cons? Well... You have to do the replace twice for the case ... ...> as it removes only first unwanted tag from every tag on the page

[edit:] SOLUTION

Why not use jQuery, put the html code into the page and do something like this:

$(containerOrSth).find('a').each(function(){
 if($(this).children().length==0){
 $(this).text($(this).text().replace('google','evil')); 
 }else{
 //here You have to care about children tags, but You have to know where to expect them - before or after text. comment for more help
 }
});

Another con is that it's not a parser. – BalusC Jan 29 '10 at 12:11 — BalusC, Jan 29 '10 at 12:11
Hey, I said "if You can't use a parser" - so yes, it's not – naugtur Jan 29 '10 at 12:52 — naugtur, Jan 29 '10 at 12:52

score 1 · Answer 6 · answered Mar 05 '21 at 12:42

1

I'm using regex = new RegExp("(?=[^>]*<)google", 'i');

answered Mar 05 '21 at 12:42

George WB

11
1

This lookahead works for my case. Please note that the replacement only works if one opening tag follows the keyword 'google' (which should always be the case for valid HTML). I also added 'g' flag so that multiple occurrences of 'google' inside the same tag are correctly replaced. – walderich Jul 29 '21 at 17:06

score 0 · Answer 7 · answered Jul 21 '09 at 11:39

0

you can't really do that, your "google" is always in some tag, either replace all or none

answered Jul 21 '09 at 11:39

skrat

5,518
3
32
48

score 0 · Answer 8 · answered Jul 21 '09 at 12:34

0

Well, since everything is part of a tag, your request makes no real sense. If it's just the <a /> tag, you might just check for that part. Mainly by making sure you don't have a tailing </a> tag before a fresh <a>

answered Jul 21 '09 at 12:34

Grubsnik

918
9
25

score 0 · Answer 9 · edited May 23 '17 at 10:33

You can do that using REGEX, but filtering blocks like STYLE, SCRIPT and CDATA will need more work, and not implemented in the following solution.

Most of the answers state that 'your data is always in some tags' but they are missing the point, the data is always 'between' some tags, and you want to filter where it is 'in' a tag.

Note that tag characters in inline scripts will likely break this, so if they exist, they should be processed seperately with this method. Take a look at here :
complex html string.replace function

score 0 · Answer 10 · answered Jul 27 '22 at 23:06

I can give you a hacky solution… Pick a non printable character that’s not in your string…. Dup your buffer… now overwrite the tags in your dup buffer using the non printable character… perform regex to find position and length of match on dup buffer … Now you know where to perform replace in original buffer

Regex replace string but not inside html tag

10 Answers10

Linked