0

I have this string in JavaScript:

s = "</p><ol><li>First\n</li><li>Second\n</li></ol><p>"

Then I do this (to remove the outer "</p>...<p>"):

s = s.replace(/^<\/([^> ]+)[^>]*>(.*)<\1>$/,"$2");

Nothing happens (s is unchanged, and using match() returns false), but if I go try it at http://www.regular-expressions.info/javascriptexample.html, it works!

I've tried all sorts of things (creating a separate regExp object, using //g, taking out the ^$, replacing [^> ]+ with [a-z0-9]*...) but nothing makes any difference.

It's driving me nuts. can anyone tell me what I'm doing wrong?

user1636349
  • 458
  • 1
  • 4
  • 21
  • What you're doing wrong is using Regular Expressions to parse HTML. HTML is not a Regular Language. See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags More than that, it's invalid HTML – Benjamin Gruenbaum Mar 14 '13 at 11:50
  • 1
    @BenjaminGruenbaum: While that's a fun answer, it's completely incorrect in this case. OP is just trying to remove a starting tag, and matching ending tag, for this a regex will do perfectly fine. – Evert Mar 14 '13 at 11:52
  • It's not about being fun, it's about being completely wrong in the approach. Using a RegExp to parse HTML is wrong, HTML is not a regular language, it's _very_ easy to show that it's impossible to build a DFA to parse it. Even if it's possible to build such a DFA (or Regex) for this specific case, i t's the wrong tool to approach this problem. Just because it's possible doesn't mean one should do it. – Benjamin Gruenbaum Mar 14 '13 at 11:54
  • Note that the problem here is that the string starts with a closing tag, and ends with an opening tag. That's why I'm using an RE: it isn't well-formed HTML. – user1636349 Mar 14 '13 at 12:33
  • Problem now solved as per MikeM below. And meanwhile if anyone took BenjaminGruenbaum's advice they'd still be crafting an LL(k) sledgehammer for a very very small RE nut. – user1636349 Mar 17 '13 at 16:05

1 Answers1

1

The problem is simply that . does not match newlines \n.

If you replace the .* with [\s\S]*, your regex should work.

[\s\S] means match any space or non-space character, which equates to match any character.

MikeM
  • 13,156
  • 2
  • 34
  • 47