1

I have some HTML code with inline javascript in a <script> tag that contains a regular expression removing superflous whitespace between a > and a < character as in

<script>
[...]
output = output.replace(/>\s*</g, '><');
[...]
</script> 

This is invalid HTML (e.g., according to PHPs DOMDocument->loadHTML()), as the character sequence </ ends processing and is expected to be followed by the rest of the closing tag script>.

I have tried to escape the < as &lt; but then the expression doesn't match anymore (tested in jsfiddle).

A workaround is to insert something in the regular expression that doesn't actually do anything but separates the < from the /, such as

output = output.replace(/>\s*[<]/g, '><');

This works and has the expected behavior, but looks like a terrible hack.

What is the right way to escape < before / in a js regular expression?

cgogolin
  • 960
  • 1
  • 10
  • 22
  • 1
    One of the many reasons not to put code of any significance directly in `script` tags. – T.J. Crowder Jun 23 '19 at 17:07
  • Have a try with: `output = output.replace(/>\s*\<');` – Toto Jun 23 '19 at 17:12
  • @Toto - That won't help – T.J. Crowder Jun 23 '19 at 17:13
  • 1
    You don't need to escape anything. You read something wrong. What delimits script is script and nothing else. https://regex101.com/r/t4HQKV/1 –  Jun 23 '19 at 17:16
  • 2
    You can create regexp without using regexp literal, e.g.: `const re = new RegExp('>\\s*<'); output = output.replace(re), '><';` -- as you can see this way it does not need slashes at start and end. – alx Jun 23 '19 at 17:18
  • 1
    `I have tried to escape the < as <` This should fix the problem as entities are substituted during the html parse process. That's their only purpose. –  Jun 23 '19 at 17:21
  • @sln - But the content of a `script` element **isn't** HTML, so HTML entities don't work there. (Contrast with, for instance, the content of an `onclick` attribute, which *is* HTML and so `onclick="if (foo < bar) alert("Hi there")"` works just fine. :-) ) – T.J. Crowder Jun 23 '19 at 17:25
  • 1
    If you insist on modifying the regex, use the expanded modifier and just move the characters away from each other `(/ > \s* < /xg` Because `<:space:/` is not a legal start of a closing tag. –  Jun 23 '19 at 17:27
  • As I said, `<` doesn't work, I tried. Creating the regexp separately is a good suggestion. The `x` modifier is also a neath trick. Thanks! – cgogolin Jun 24 '19 at 00:36
  • @sln: there's no x flag in Javascript. (but this flag is available in XRegExp). – Casimir et Hippolyte Jun 24 '19 at 02:50
  • Just a couple more options for ya. Compile the regex as an object ? `rx = new RegExp(">\\s*<","g");` or if your hung up on the // operators, change the regex to something more reasonable `output = output.replace(/>\s*(?=<)/g, '>');` but remember that `/>` is also the ending of a self contained tag. –  Jun 25 '19 at 08:21

1 Answers1

0

If PHP's DOMDocument->loadHTML() thinks the script element ends there, I'm fairly sure it's a bug in DOMDocument->loadHTML(). Script elements end with </script>, and the content of script elements is not HTML. script elements have a much more...interesting...content model than that which the spec takes several paragraphs to explain.

Regarding issues with </, the spec only mentions dealing with <!-- and </script>, not </ in general.

But if you have to have inline script (you wouldn't have this problem if the code were in a .js file), and you have to load it with something that apparently has a bug, your hack with the character class ([<] rather than <) isn't bad at all. (I doubt performance is your concern, but if it were, I think we can probably say with a fair bit of assurance that the JavaScript engine's regular expression handler is going to be able to optimize that single-character character class away.)

T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875
  • 1
    Thanks for this useful answer! I discovered the issue with `DOMDocument-loadHTML()`, but what made me think that `` is an actual problem and not just a bug in that function was this stackexchange question https://stackoverflow.com/questions/14780858/escape-in-script-tag-contents . I am not concerned about performance at all, rather that someone going through the code in the future will think "Ah, I am going to remove this superfluous charcter class..." :-) – cgogolin Jun 24 '19 at 00:33
  • @cgogolin - Yeah, definitely would need commenting. :-) – T.J. Crowder Jun 24 '19 at 06:40