7

I want to replace a string in HTML page using JavaScript but ignore it, if it is in an HTML tag, for example:

<a href="google.com">visit google search engine</a>
you can search on google tatatata...

I want to replace google by <b>google</b>, but not here:

<a href="google.com">visit google search engine</a>
you can search on <b>google</b> tatatata...

I tried with this one:

regex = new RegExp(">([^<]*)?(google)([^>]*)?<", 'i');
el.innerHTML =  el.innerHTML.replace(regex,'>$1<b>$2</b>$3<');

but the problem: I got <b>google</b> inside the <a> tag:

<a href="google.com">visit <b>google</b> search engine</a>
you can search on <b>google</b> tatatata...

How can fix this?

tvanfosson
  • 524,688
  • 99
  • 697
  • 795

10 Answers10

6

You'd be better using an html parser for this, rather than regex. I'm not sure it can be done 100% reliably.

Draemon
  • 33,955
  • 16
  • 77
  • 104
5

You may or may not be able to do with with a regexp. It depends on how precisely you can define the conditions. Saying you want the string replaced except if it's in an HTML tag is not narrow enough, since everything on the page is presumably within some HTML tag (BODY if nothing else).

It would probably work better to traverse the DOM tree for this instead of trying to use a regexp on the HTML.

jhurshman
  • 5,861
  • 2
  • 26
  • 16
  • 1
    I agree. Find all the text nodes in the DOM that contain the string. Keep a blacklist of tags that you **don't** want to replace the string in. Check if the text node is inside one of these tags. If not, do your replacement, otherwise leave it as is. – tvanfosson Jul 21 '09 at 11:45
1

Parsing HTML with a regular expression is not going to be easy for anything other than trivial cases, since HTML isn't regular.

For more details see this Stackoverflow question (and answers).

Community
  • 1
  • 1
Brian Agnew
  • 268,207
  • 37
  • 334
  • 440
1

I think you're all missing the question here...

When he says inside the tag, he means inside the opening tag, as in the <a href="google.com"> tag...This is something quite different than text, say, inside a <p> </p> tag pair or <body> </body>. While I don't have the answer yet, I'm struggling with this same problem and I know it has to be solvable using regex. Once I figure it out, i'll come back and post.

1

WORKAROUND

If You can't use a html parser or are quite confident about Your html structure try this:

  1. do the "bad" changing
  2. repeat replace (<[^>]*)(<[^>]+>) to $1 a few times (as much as You need)

It's a simple workaround, but works for me.

Cons? Well... You have to do the replace twice for the case ... ...> as it removes only first unwanted tag from every tag on the page

[edit:] SOLUTION

Why not use jQuery, put the html code into the page and do something like this:

$(containerOrSth).find('a').each(function(){
 if($(this).children().length==0){
 $(this).text($(this).text().replace('google','evil')); 
 }else{
 //here You have to care about children tags, but You have to know where to expect them - before or after text. comment for more help
 }
});
naugtur
  • 16,827
  • 5
  • 70
  • 113
1

I'm using regex = new RegExp("(?=[^>]*<)google", 'i');

George WB
  • 11
  • 1
  • This lookahead works for my case. Please note that the replacement only works if one opening tag follows the keyword 'google' (which should always be the case for valid HTML). I also added 'g' flag so that multiple occurrences of 'google' inside the same tag are correctly replaced. – walderich Jul 29 '21 at 17:06
0

you can't really do that, your "google" is always in some tag, either replace all or none

skrat
  • 5,518
  • 3
  • 32
  • 48
0

Well, since everything is part of a tag, your request makes no real sense. If it's just the <a /> tag, you might just check for that part. Mainly by making sure you don't have a tailing </a> tag before a fresh <a>

Grubsnik
  • 918
  • 9
  • 25
0

You can do that using REGEX, but filtering blocks like STYLE, SCRIPT and CDATA will need more work, and not implemented in the following solution.

Most of the answers state that 'your data is always in some tags' but they are missing the point, the data is always 'between' some tags, and you want to filter where it is 'in' a tag.

Note that tag characters in inline scripts will likely break this, so if they exist, they should be processed seperately with this method. Take a look at here :
complex html string.replace function

Community
  • 1
  • 1
DarkWingDuck
  • 2,028
  • 1
  • 22
  • 30
0

I can give you a hacky solution… Pick a non printable character that’s not in your string…. Dup your buffer… now overwrite the tags in your dup buffer using the non printable character… perform regex to find position and length of match on dup buffer … Now you know where to perform replace in original buffer

pico
  • 1,660
  • 4
  • 22
  • 52