To the last tag (already in a string) RegEx

Question

I do not know what I am doing wrong. I have this string that I want to replace

<?xml version="1.0" encoding="utf-8" ?>
 <Sections>
  <Section>

I am using regex to replace everything including <Section>, and leave the rest untouched.

arrayValues[index].replace("/[([.,\n,\s])*<Section>]/", "---");

What is wrong with my regex? Doesn't this mean repalce every character, including new line and spaces, up to and including <Section> with ---?

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Kristian, Aug 06 '13 at 18:23
@kristian http://meta.stackexchange.com/questions/182189/please-stop-linking-to-the-zalgo-anti-cthulhu-regex-rant, besides, regex CAN parse html just fine, despite popular belief. — user428517, Aug 06 '13 at 18:26
I understand that regex can't be use with tags, but I already made the XML doc into a string. I also check using typeOf on arrayValues[index] to make sure that they are a string. — Jack Thor, Aug 06 '13 at 18:26
You should be processing this using [DOM](http://en.wikipedia.org/wiki/Document_Object_Model), not regex. You are collapsing all whitespace, which is creating this: `---------------`..etc — Alex W, Aug 06 '13 at 18:27
@Kristian You'd be better of linking to that answer, I've seen people think the Q is different so the answer doesn't apply... — Basic, Aug 06 '13 at 18:30
Well I am uploading the XML doc on the client side using a reader and is using readAsText so it is returning me a string. — Jack Thor, Aug 06 '13 at 18:31
@JackThor The problem isn't that it can't be used with tags - all html/xml docs with tags are just specially formatted strings - the problem is that the layout and formatting of those tags in the string has meaning more complex than a regex can cope with (at least, unless you're talking about thousands of characters in your regex). It's just the wrong tool for the job — Basic, Aug 06 '13 at 18:31
regex is a **perfectly fine** tool for this job. he's got a bit of text, and wants to replace some of it. regex will work perfectly well for this; in fact, that's what it's designed to do. — user428517, Aug 06 '13 at 18:35
@sgroves, regexs can help *lex* HTML but they can't be used to correctly *parse* HTML, because [whether `<![CDATA[` starts a CDATA section](http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#cdata-sections) depends on whether you are in a foreign XML context and determining that requires matching end tags with start tags which is not doable solely with regular expressions, not even when extended with back-references. Yes, you can hack something together to solve a particular problem on a subset of HTML, but that's not the same as parsing HTML. — Mike Samuel, Aug 06 '13 at 18:52
@MikeSamuel correct, i technically meant lex not parse. the vast majority of regex/html questions here talk about lexing, and 99% of the time someone links to that cthulu post it's entirely inapplicable to the problem. — user428517, Aug 06 '13 at 18:55

score 2 · Accepted Answer · edited May 23 '17 at 12:28

First of all, you need to remove the quotes around your regex—if they're there, the argument won't be processed as a regex. JavaScript will see it as a string (because it is a string) and try to match it literally.

Now that that's taken care of, we can simplify your regex a bit:

arrayValues[index].replace(/[\s\S]*?<Section>/, "---");

[\s\S] gets around JavaScript's lack of an s flag (a handy option supported by most languages that enables . to match newlines). \s does match newlines (even without an s flag specified), so the character class [\s\S] tells the regex engine to match:

\s - a whitespace character, which could be a newline

OR

\S - a non-whitespace character

So you can think of [\s\S] as matching . (any character except a newline) or the literal \n (a newline). See Javascript regex multiline flag doesn't work for more.

? is used to make the initial [\s\S]* match non-greedy, so the regex engine will stop once it hits the first occurrence of <Section>.

reading the answer posted on the link trying to wrap my head around the [\s\S]. How does this match a new line? Maybe I should post this as a different quesiton. — Jack Thor, Aug 06 '13 at 18:43
@JackThor short answer: because `\s` will match a newline (whereas `.` will not). see my updated answer for more. — user428517, Aug 06 '13 at 18:54

score 0 · Answer 2 · answered Aug 06 '13 at 18:34

arrayValues[index].replace("/[([.,\n,\s])*<Section>]/", "---");

What is wrong with my regex?

It's no regex, it's string literal. A string would be converted to a regex, but yours would then include the slashes. Use a regex literal instead:

arrayValues[index].replace(/[\S\s]*<Section>/, "---");

Also, you have too many unnecessary characters in it. The [] around the whole thing build a character class, which is not what you want. The capturing group () just wraps a character class which can be repeated itself. And a dot . inside a character class does match a literal dot, instead of all characters.

To the last tag (already in a string) RegEx

2 Answers2