2

I have a situation where I'm generating xml for a SOAP request and receiving the data that goes into this xml from a 3rd party. This code is running on a server and doesn't have access to functions in the DOM. Sometimes the data will have xml entities already encoded and other times it will not.

For example sometimes I will receive this: Billy & Joe's Garage

And other times I will receive this: Billy & Joe's Garage

I know there are solutions to handling the first example like those found on this post: how to escape xml entities in javascript?

But if I apply those solutions to the second example I will get something like:

function escapeXml(unsafe) {
    return unsafe.replace(/[<>&'"]/g, function (c) {
        switch (c) {
            case '<': return '&lt;';
            case '>': return '&gt;';
            case '&': return '&amp;';
            case '\'': return '&apos;';
            case '"': return '&quot;';
        }
    });
}

escapeXml("Billy &amp; Joe&apos;s Garage")
// Returns "Billy &amp;amp; Joe&amp;apos;s Garage"

So for the second example the desired output would be the same as the input.

Jon Lamb
  • 1,413
  • 2
  • 16
  • 28
  • You realise, I suppose, that the input is ambiguous and that there's no rule you can apply that will give the right answer 100% of the time? – Michael Kay Mar 14 '20 at 09:40

1 Answers1

1

Of course, the real fix is to refuse corrupt XML and kick it back to the supplier. In the mean-time...

Using negative lookahead assertion, you can exclude any occurrences of & that are followed by amp;, quot; etc.

&(?!(amp|apos|lt|gt|quot);)

will do just this.

Combine this with the regex from your question and you should be able to skirt around those ampersands that are already part of a recognised entity while replacing those that are not:

const re = /&(?!(amp|apos|lt|gt|quot);)|[<>'"]/g

function escapeXml(unsafe) {
  return unsafe.replace(re, function(c) {
    switch (c) {
      case '<':
        return '&lt;';
      case '>':
        return '&gt;';
      case '&':
        return '&amp;';
      case '\'':
        return '&apos;';
      case '"':
        return '&quot;';
    }
  });
}
console.log(escapeXml("'Billy &amp; Joe&apos;s Garage & something else'"))
spender
  • 117,338
  • 33
  • 229
  • 351
  • Thank you, I forgot about the lookahead expressions. And yes, this is an unusual circumstance that should not happen in the first place. – Jon Lamb Mar 14 '20 at 19:00