3

I am trying to replace HTML content with regular expression.

from

<A HREF="ZZZ">test test ZZZ<SPAN>ZZZ test test</SPAN></A>

to

<A HREF="ZZZ">test test AAA<SPAN>AAA test test</SPAN></A>

note that only words outside HTML tags are replaced from ZZZ to AAA.

Any idea? Thanks a lot in advance.

iwan
  • 7,269
  • 18
  • 48
  • 66
  • 2
    please read the first answer to this question: [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Mat May 18 '11 at 07:31
  • 1
    Thanks Mat for the referral. After reading the link, I've simplified the question, since I know the HTML will be "regular" type of HTML. – iwan May 18 '11 at 07:40
  • 1
    Then you misread that link. Don't use regex to parse HTML, it's too complex. Use an (X)HTML parser. – Mat May 18 '11 at 07:42

5 Answers5

7

You could walk all nodes, replacing text in text ones (.nodeType == 3):

Something like:

element.find('*:contains(ZZZ)').contents().each(function () {
    if (this.nodeType === 3)
        this.nodeValue = this.nodeValue.replace(/ZZZ/g,'AAA')
})

Or same without jQuery:

function replaceText(element, from, to) {
    for (var child = element.firstChild; child !== null; child = child.nextSibling) {
        if (child.nodeType === 3)
            this.nodeValue = this.nodeValue.replace(from,to)
        else if (child.nodeType === 1)
            replaceText(child, from, to);
    }
}

replaceText(element, /ZZZ/g, 'AAA');
Suor
  • 2,845
  • 1
  • 22
  • 28
  • `textContent` isn't universally supported for text nodes. Use either `nodeValue` or `data` properties instead. Also, if the first parameter passed to a string's `replace()` method is a string, only the first occurrence of that string will be replaced. Use a regular expression with the global flag (e.g. `/ZZZ/g`) instead to replace all occurrences. – Tim Down May 18 '11 at 08:31
  • Hi Suor, your JS function is answering the question i described. But i just realized that I oversimplify actual issue. I need to change from ZZZ to something like – iwan May 18 '11 at 14:25
  • Then you should create that span node and insert it there. You can do `var html = this.nodeValue.replace(from, to); $(this).replaceWith(html)` instead of simple nodeValue assignment. And it would be trickier but possible without jQuery. – Suor May 21 '11 at 05:59
2

The best idea in this case is most certainly to not use regular expressions to do this. At least not on their own. JavaScript surely has a HTML Parser somewhere?

If you really must use regular expressions, you could try to look for every instance of ZZZ that is followed by a "<" before any ">". That would look like

ZZZ(?=[^>]*<)

This might break horribly if the code contains HTML comments or script blocks, or is not well formed.

Jens
  • 25,229
  • 9
  • 75
  • 117
  • yep, that makes sense - HTMLcomments, script blocks, or any other xhtml CDATA would mess things up no matter what regexp you come up with – Tao May 18 '11 at 07:40
0

Assuming a well-formed html document with outer/enclosing tags like <html>, I would think the easiest way would be to look for the > and < signs:

/(\>[^\>\<]*)ZZZ([^\>\<]*\<)/$1AAA$2/

If you're dealing with HTML fragments that may not have enclosing tags, it gets a little more complicated, you'd have to allow for start of string and end of string

Example JS (sorry, missed the tag):

alert('<A HREF="ZZZ">test test ZZZ<SPAN>ZZZ test test</SPAN></A>'.replace(/(\>[^\>\<]*)ZZZ([^\>\<]*\<)/g, "$1AAA$2"));

Explanation: for each match that

  • starts with >: \>
  • follows with any number of characters that are neither > nor <: [^\>\<]*
  • then has "ZZZ"
  • follows with any number of characters that are neither > nor <: [^\>\<]*
  • and ends with <: \<

Replace with

  • everything before the ZZZ, marked with the first capture group (parentheses): $1
  • AAA
  • everything after the ZZZ, marked with the second capture group (parentheses): $2

Using the "g" (global) option to ensure that all possible matches are replaced.

Tao
  • 13,457
  • 7
  • 65
  • 76
  • Thanks TAO, you're great.. if you have brief explanation about the regex it will be helpful, thanks again... – iwan May 18 '11 at 15:21
  • done, hope that helps. I would recommend you use a DOM-traversal method if possible through, as outlined in @Suor's answer and @Tim Down's comment; this type of solution will always be more reliable. As @Jens noted, the regexp solution in this answer will probably break under some circumstances. – Tao May 18 '11 at 15:34
0

Try this:

var str = '<DIV>ZZZ test test</DIV><A HREF="ZZZ">test test ZZZ</A>';
var rpl = str.match(/href=\"(\w*)\"/i)[1];
console.log(str.replace(new RegExp(rpl + "(?=[^>]*<)", "gi"), "XXX"));
jerone
  • 16,206
  • 4
  • 39
  • 57
0

have you tried this:

replace:

>([^<>]*)(ZZZ)([^<>]*)<

with:

>$1AAA$3<

but beware all the savvy suggestions in the post linked in the first comment to your question!

sergio
  • 68,819
  • 11
  • 102
  • 123
  • beware!? i did not mean it, sorry... take into account all the savvy suggestions.... – sergio May 18 '11 at 07:51
  • hi sergio, thank you for your suggestion, i like your idea.. it almost works, but not perfect yet , it gave me -- < A HREF="ZZZ"/>test test AAAZZZ test test< /A > – iwan May 18 '11 at 15:09
  • try this link: http://regexr.com?2tpq1, it is flash-based regex engine initialized with my suggestion... it seems to work ok... are you using the "g" flag (for global replace)? – sergio May 18 '11 at 15:21