-1

I do not know what I am doing wrong. I have this string that I want to replace

<?xml version="1.0" encoding="utf-8" ?>
 <Sections>
  <Section>

I am using regex to replace everything including <Section>, and leave the rest untouched.

arrayValues[index].replace("/[([.,\n,\s])*<Section>]/", "---");

What is wrong with my regex? Doesn't this mean repalce every character, including new line and spaces, up to and including <Section> with ---?

Jack Thor
  • 1,554
  • 4
  • 24
  • 53
  • 6
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Kristian Aug 06 '13 at 18:23
  • Tony? Tony? ... Tony? – Dave Newton Aug 06 '13 at 18:25
  • 1
    @kristian http://meta.stackexchange.com/questions/182189/please-stop-linking-to-the-zalgo-anti-cthulhu-regex-rant, besides, regex CAN parse html just fine, despite popular belief. – user428517 Aug 06 '13 at 18:26
  • I understand that regex can't be use with tags, but I already made the XML doc into a string. I also check using typeOf on arrayValues[index] to make sure that they are a string. – Jack Thor Aug 06 '13 at 18:26
  • You should be processing this using [DOM](http://en.wikipedia.org/wiki/Document_Object_Model), not regex. You are collapsing all whitespace, which is creating this: `---------------`..etc – Alex W Aug 06 '13 at 18:27
  • @Kristian You'd be better of linking to that answer, I've seen people think the Q is different so the answer doesn't apply... – Basic Aug 06 '13 at 18:30
  • Well I am uploading the XML doc on the client side using a reader and is using readAsText so it is returning me a string. – Jack Thor Aug 06 '13 at 18:31
  • 1
    @JackThor The problem isn't that it can't be used with tags - all html/xml docs with tags are just specially formatted strings - the problem is that the layout and formatting of those tags in the string has meaning more complex than a regex can cope with (at least, unless you're talking about thousands of characters in your regex). It's just the wrong tool for the job – Basic Aug 06 '13 at 18:31
  • 1
    regex is a **perfectly fine** tool for this job. he's got a bit of text, and wants to replace some of it. regex will work perfectly well for this; in fact, that's what it's designed to do. – user428517 Aug 06 '13 at 18:35
  • 1
    @sgroves, regexs can help *lex* HTML but they can't be used to correctly *parse* HTML, because [whether `<![CDATA[` starts a CDATA section](http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#cdata-sections) depends on whether you are in a foreign XML context and determining that requires matching end tags with start tags which is not doable solely with regular expressions, not even when extended with back-references. Yes, you can hack something together to solve a particular problem on a subset of HTML, but that's not the same as parsing HTML. – Mike Samuel Aug 06 '13 at 18:52
  • @MikeSamuel correct, i technically meant lex not parse. the vast majority of regex/html questions here talk about lexing, and 99% of the time someone links to that cthulu post it's entirely inapplicable to the problem. – user428517 Aug 06 '13 at 18:55

2 Answers2

2

First of all, you need to remove the quotes around your regex—if they're there, the argument won't be processed as a regex. JavaScript will see it as a string (because it is a string) and try to match it literally.

Now that that's taken care of, we can simplify your regex a bit:

arrayValues[index].replace(/[\s\S]*?<Section>/, "---");

[\s\S] gets around JavaScript's lack of an s flag (a handy option supported by most languages that enables . to match newlines). \s does match newlines (even without an s flag specified), so the character class [\s\S] tells the regex engine to match:

  • \s - a whitespace character, which could be a newline

OR

  • \S - a non-whitespace character

So you can think of [\s\S] as matching . (any character except a newline) or the literal \n (a newline). See Javascript regex multiline flag doesn't work for more.


? is used to make the initial [\s\S]* match non-greedy, so the regex engine will stop once it hits the first occurrence of <Section>.

Community
  • 1
  • 1
user428517
  • 4,132
  • 1
  • 22
  • 39
  • I'd make the class non-greedy, just in case. – georg Aug 06 '13 at 18:36
  • reading the answer posted on the link trying to wrap my head around the [\s\S]. How does this match a new line? Maybe I should post this as a different quesiton. – Jack Thor Aug 06 '13 at 18:43
  • @JackThor short answer: because `\s` will match a newline (whereas `.` will not). see my updated answer for more. – user428517 Aug 06 '13 at 18:54
0
arrayValues[index].replace("/[([.,\n,\s])*<Section>]/", "---");

What is wrong with my regex?

It's no regex, it's string literal. A string would be converted to a regex, but yours would then include the slashes. Use a regex literal instead:

arrayValues[index].replace(/[\S\s]*<Section>/, "---");

Also, you have too many unnecessary characters in it. The [] around the whole thing build a character class, which is not what you want. The capturing group () just wraps a character class which can be repeated itself. And a dot . inside a character class does match a literal dot, instead of all characters.

Bergi
  • 630,263
  • 148
  • 957
  • 1,375