0

I am using the following regular expression in Javascript:

comment_body_content = comment_body_content.replace(
  /(<span id="sc_start_commenttext-.*<\/span>)((.|\s)*)(<span id="sc_end_commenttext-.*<\/span>)/,
  "$1$4"
);

I want to find in my HTML code this tag <span id="sc_start_commenttext-330"></span> (the number is always different) and the tag <span id="sc_end_commenttext-330"></span>. Then the text and HTML code between those tags should be deleted and given back.

Example before replacing:

Some text and code
<span id="sc_start_commenttext-330"></span>Some text and code<span id="sc_end_commenttext-330"></span>
Some Text and code

Example after replacing:

Some text and code
<span id="sc_start_commenttext-330"></span><span id="sc_end_commenttext-330"></span>
Some text and code

Sometimes my regular expression works and it replaces the text correctly, sometimes not - is there a mistake? Thank you for help!

Alex

dda
  • 6,030
  • 2
  • 25
  • 34
user1711384
  • 343
  • 1
  • 7
  • 24
  • 4
    Why do you use [regular expressions to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) when you are using JavaScript - which is arguably the language in which proper DOM manipulation is easiest. – Martin Ender Dec 09 '12 at 16:13
  • These `spans` have no content? – Grzegorz Gierlik Dec 09 '12 at 16:14
  • you are right, but the html is generated by a cms and its not easy to change it on that way (not valide code), so i decided to do it so... – user1711384 Dec 09 '12 at 16:15
  • yes, the spans are empty, they are only some "markers" – user1711384 Dec 09 '12 at 16:16
  • @user1711384 if the HTML is invalid that makes it even harder for regex to deal with it (while the DOM parser might be able to handle it anyway). Can you be 100% sure that the tags are always `` without any extraneous spaces? – Martin Ender Dec 09 '12 at 16:21
  • 3
    I do not feel that an upvote of the comment by @m.buettner is strong enough so I will repeat it: [DO NOT PARSE HTML WITH A REGEX. EVER.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – jbabey Dec 09 '12 at 16:40
  • @jbabey I do not agree with the "ever". Some problems are not tied to the full complexity of HTML, but just happen to be done on an HTML file, in which case the actual problem might become regular. Then again, it's probably not actually "parsing HTML". But if you know what you are doing, you can run a regex on an HTML file. Just wanted to point that out. – Martin Ender Dec 09 '12 at 16:44
  • Run a regular expression over DOM and be afraid of all the unbinding you did. You can constraint it all you want but you can't deny this fact – Alexander Dec 09 '12 at 16:45

4 Answers4

2

You should use a pattern that matches the start with its corresponding end, for example:

/(<span id="sc_start_commenttext-(\d+)"><\/span>)[^]*?(<span id="sc_end_commenttext-\2"><\/span>)/

Here \2 in the end tag refers to the matched string of (\d+) which matches the digits 330 in the start tag. [^] is a simple expression for any character.

Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • `[^]` ... does that really work? I think most engines, would not treat `]` as the closing bracket if there are no characters inside, thus throwing a compilation error. "any character" is usually matched with `[\s\S]` – Martin Ender Dec 09 '12 at 16:23
  • Wow, it seems to work in Chrome, too. JavaScript, you never cease to surprise me. – Martin Ender Dec 09 '12 at 16:26
  • @m.buettner From the ECMAScript 5 specification: “The production *CharacterClass* :: `[` `^` *ClassRanges* `]` evaluates by evaluating *ClassRanges* to obtain a CharSet and returning that CharSet and the Boolean true.” (15.10.2.13) and “The production *ClassRanges* :: [empty] evaluates by returning the empty CharSet.” (15.10.2.14) – Gumbo Dec 09 '12 at 16:32
  • @Gumbo cheers. For most other regex engines empty character classes are not allowed which enables you to leave `]` unescaped inside the character class if it's the first character. (I just tested this with PCRE, .NET and Java) ... good to know that JavaScript has this quirkyness, because it's also the only major engine that doesn't have a `dotall` option. – Martin Ender Dec 09 '12 at 16:35
  • Gumbo, i just identified an error with your Code in IE 8. This is my current code: `comment_body_content.replace(/(<\/span>)[^]*?(<\/span>)/, "$1$3");` IE 8 seems to have a problem with the \2 - now the whole JS is not working there...do u have any idea how to solve that? Thanks a lot!! – user1711384 Apr 25 '13 at 17:22
1

Using DOM.

​var $spans = document.getElementsByTagName("span");
var str = "";

for(var i = 0, $span, $sibling; i < $spans.length; ++i) {
    $span = $spans[i];
    if(/^sc_start_commenttext/i.test($span.id)) {
        while($sibling = $span.nextSibling) {
            if(/^sc_end_commenttext/i.test($sibling.id)) {
                break;
            }
            str += $sibling.data;
            $span.parentNode.removeChild($sibling);
        }
    }
}

console.log("The enclosed string was: ", str);

Here you have it.

Alexander
  • 23,432
  • 11
  • 63
  • 73
0

I would start to replace .* with [0-9]+"> -- if I understand correctly your intention.

Grzegorz Gierlik
  • 11,112
  • 4
  • 47
  • 55
0

I agree that it's normaly a bad ide to use regexp to parse html but it can be used effectly on non-nested html

Using RegExp:

var str = 'First text and code<span id="sc_start_commenttext-330"></span>Remove text<span id="sc_end_commenttext-330"></span>Last Text and code';
var re = /(.*<span id="sc_start_commenttext-\d+"><\/span>).*(<span id="sc_end_commenttext-\d+"><\/span>.*)/;
str.replace(re, "$1$2");

Result:

First text and code<span id="sc_start_commenttext-330"></span><span id="sc_end_commenttext-330"></span>Last Text and code

Terje Rosenlund
  • 153
  • 1
  • 9