Regular expression does not work

Question

I am using the following regular expression in Javascript:

comment_body_content = comment_body_content.replace(
  /(<span id="sc_start_commenttext-.*<\/span>)((.|\s)*)(<span id="sc_end_commenttext-.*<\/span>)/,
  "$1$4"
);

I want to find in my HTML code this tag  (the number is always different) and the tag . Then the text and HTML code between those tags should be deleted and given back.

Example before replacing:

Some text and code
<span id="sc_start_commenttext-330"></span>Some text and code<span id="sc_end_commenttext-330"></span>
Some Text and code

Example after replacing:

Some text and code
<span id="sc_start_commenttext-330"></span><span id="sc_end_commenttext-330"></span>
Some text and code

Sometimes my regular expression works and it replaces the text correctly, sometimes not - is there a mistake? Thank you for help!

Alex

Why do you use [regular expressions to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) when you are using JavaScript - which is arguably the language in which proper DOM manipulation is easiest. — Martin Ender, Dec 09 '12 at 16:13
you are right, but the html is generated by a cms and its not easy to change it on that way (not valide code), so i decided to do it so... — user1711384, Dec 09 '12 at 16:15
@user1711384 if the HTML is invalid that makes it even harder for regex to deal with it (while the DOM parser might be able to handle it anyway). Can you be 100% sure that the tags are always `` without any extraneous spaces? — Martin Ender, Dec 09 '12 at 16:21
I do not feel that an upvote of the comment by @m.buettner is strong enough so I will repeat it: [DO NOT PARSE HTML WITH A REGEX. EVER.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — jbabey, Dec 09 '12 at 16:40
@jbabey I do not agree with the "ever". Some problems are not tied to the full complexity of HTML, but just happen to be done on an HTML file, in which case the actual problem might become regular. Then again, it's probably not actually "parsing HTML". But if you know what you are doing, you can run a regex on an HTML file. Just wanted to point that out. — Martin Ender, Dec 09 '12 at 16:44
Run a regular expression over DOM and be afraid of all the unbinding you did. You can constraint it all you want but you can't deny this fact — Alexander, Dec 09 '12 at 16:45

score 2 · Accepted Answer · answered Dec 09 '12 at 16:14

2

You should use a pattern that matches the start with its corresponding end, for example:

/(<span id="sc_start_commenttext-(\d+)"><\/span>)[^]*?(<span id="sc_end_commenttext-\2"><\/span>)/

Here \2 in the end tag refers to the matched string of (\d+) which matches the digits 330 in the start tag. [^] is a simple expression for any character.

answered Dec 09 '12 at 16:14

Gumbo

643,351
109
780
844

`[^]` ... does that really work? I think most engines, would not treat `]` as the closing bracket if there are no characters inside, thus throwing a compilation error. "any character" is usually matched with `[\s\S]` – Martin Ender Dec 09 '12 at 16:23
Wow, it seems to work in Chrome, too. JavaScript, you never cease to surprise me. – Martin Ender Dec 09 '12 at 16:26
@m.buettner From the ECMAScript 5 specification: “The production *CharacterClass* :: `[` `^` *ClassRanges* `]` evaluates by evaluating *ClassRanges* to obtain a CharSet and returning that CharSet and the Boolean true.” (15.10.2.13) and “The production *ClassRanges* :: [empty] evaluates by returning the empty CharSet.” (15.10.2.14) – Gumbo Dec 09 '12 at 16:32
@Gumbo cheers. For most other regex engines empty character classes are not allowed which enables you to leave `]` unescaped inside the character class if it's the first character. (I just tested this with PCRE, .NET and Java) ... good to know that JavaScript has this quirkyness, because it's also the only major engine that doesn't have a `dotall` option. – Martin Ender Dec 09 '12 at 16:35
Gumbo, i just identified an error with your Code in IE 8. This is my current code: `comment_body_content.replace(/(<\/span>)[^]*?(<\/span>)/, "$1$3");` IE 8 seems to have a problem with the \2 - now the whole JS is not working there...do u have any idea how to solve that? Thanks a lot!! – user1711384 Apr 25 '13 at 17:22

score 1 · Answer 2 · answered Dec 09 '12 at 16:42

Using DOM.

var $spans = document.getElementsByTagName("span");
var str = "";

for(var i = 0, $span, $sibling; i < $spans.length; ++i) {
    $span = $spans[i];
    if(/^sc_start_commenttext/i.test($span.id)) {
        while($sibling = $span.nextSibling) {
            if(/^sc_end_commenttext/i.test($sibling.id)) {
                break;
            }
            str += $sibling.data;
            $span.parentNode.removeChild($sibling);
        }
    }
}

console.log("The enclosed string was: ", str);

Here you have it.

score 0 · Answer 3 · answered Dec 09 '12 at 16:14

0

I would start to replace .* with [0-9]+"> -- if I understand correctly your intention.

answered Dec 09 '12 at 16:14

Grzegorz Gierlik

11,112
4
47
55

score 0 · Answer 4 · answered Dec 11 '12 at 19:15

I agree that it's normaly a bad ide to use regexp to parse html but it can be used effectly on non-nested html

Using RegExp:

var str = 'First text and codeRemove textLast Text and code';
var re = /(.*<\/span>).*(<\/span>.*)/;
str.replace(re, "$1$2");

Result:

First text and codeLast Text and code

. (dot) matches any character except a newline so it might be a good ide to remove any whitespace first... — Terje Rosenlund, Dec 11 '12 at 19:21

Regular expression does not work

4 Answers4