92

In JavaScript (server side nodejs) I'm writing a program which generates xml as output.

I am building the xml by concatenating a string:

str += '<' + key + '>';
str += value;
str += '</' + key + '>';

The problem is: What if value contains characters like '&', '>' or '<'? What's the best way to escape those characters?

or is there any javascript library around which can escape XML entities?

zzzzBov
  • 174,988
  • 54
  • 320
  • 367
Zo72
  • 14,593
  • 17
  • 71
  • 103

12 Answers12

140

This might be a bit more efficient with the same outcome:

function escapeXml(unsafe) {
    return unsafe.replace(/[<>&'"]/g, function (c) {
        switch (c) {
            case '<': return '&lt;';
            case '>': return '&gt;';
            case '&': return '&amp;';
            case '\'': return '&apos;';
            case '"': return '&quot;';
        }
    });
}
hgoebl
  • 12,637
  • 9
  • 49
  • 72
  • This seems like a better solution. Why no upticks? – Victor Grazi Mar 18 '15 at 13:59
  • 2
    @VictorGrazi: your right, its in 49 of 50 tests the faster solution. Maybe its because its nearly 5 years younger than the accepted answer. – Sebastian Aug 12 '15 at 13:56
  • 1
    @Sebastian ahh, that would explain it, thanks. Look here folks ^ ^ ^ ^ this is the solution you want!!! – Victor Grazi Aug 12 '15 at 17:42
  • This strikes me as a better solution than the accepted answer, which traverses the whole string five times (serially, reducing the scope for JS engine optimisation) looking for a match against a single character; *hgoebl*'s solution traverses the input string only once, trying to match each character to one of five conditions. The question is what is more costly: **1)** traversing the string; or: **2)** matching each character against 5 possible characters. My intuition is that **1)** would be the more costly. – Jamie Birch Aug 11 '17 at 11:30
  • The problem with accepted answer: it creates ~5 copies of the string. When the string is long, it's a lot of work to allocate memory and later garbage collect the interim strings not really used anywhere. (Note: JavaScript strings are immutable.) – hgoebl Aug 11 '17 at 15:25
  • Check out my answer - it leverages native code as much as possible and squeezes out even more performance as the expense of a IE9 being the minimum supported version. – jordancpaul Nov 09 '17 at 06:57
  • What is the version of this to "decode" the string, similar to the accepted answer's `decodehtml`? – Ran Lottem Dec 02 '18 at 11:58
  • 1
    @RanLottem decoding is much more complicated if input is HTML, see [Wikipedia](https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Character_entity_references_in_HTML). It's better to use a parser (XML or document). – hgoebl Dec 02 '18 at 15:51
135

HTML encoding is simply replacing &, ", ', < and > chars with their entity equivalents. Order matters, if you don't replace the & chars first, you'll double encode some of the entities:

if (!String.prototype.encodeHTML) {
  String.prototype.encodeHTML = function () {
    return this.replace(/&/g, '&amp;')
               .replace(/</g, '&lt;')
               .replace(/>/g, '&gt;')
               .replace(/"/g, '&quot;')
               .replace(/'/g, '&apos;');
  };
}

As @Johan B.W. de Vries pointed out, this will have issues with the tag names, I would like to clarify that I made the assumption that this was being used for the value only

Conversely if you want to decode HTML entities1, make sure you decode &amp; to & after everything else so that you don't double decode any entities:

if (!String.prototype.decodeHTML) {
  String.prototype.decodeHTML = function () {
    return this.replace(/&apos;/g, "'")
               .replace(/&quot;/g, '"')
               .replace(/&gt;/g, '>')
               .replace(/&lt;/g, '<')
               .replace(/&amp;/g, '&');
  };
}

1 just the basics, not including &copy; to © or other such things


As far as libraries are concerned. Underscore.js (or Lodash if you prefer) provides an _.escape method to perform this functionality.

zzzzBov
  • 174,988
  • 54
  • 320
  • 367
  • 3
    This almost covers the 5 XML entities. Just need @apos; – Ryan Jul 09 '13 at 19:59
  • I've seen some people also replace newline (\n), tab (\r), and return (\r) – Ryan Kara Aug 06 '13 at 14:44
  • 2
    This looks like it is replacing the same string over and over again which could be performance heavy when handling lots of data. Any faster alternative? – Jonny Feb 13 '14 at 02:16
  • 2
    @Jonny, The regular expression is going to provide worse performance than the multiple calls to `.replace()`. In either case, you'd have to have a seriously huge amount of data to notice any significant issues. A faster alternative would be to benchmark your app and find the *actual* choke point (usually nested loops), rather than worry about something as negligible as this. – zzzzBov Feb 13 '14 at 03:19
  • 1
    I had 100-200 lines of data in a Google Spreadsheet. I was converting that to plists (xml) and had to replace those xml entities. I wrote a custom javascript function using the above code for that. It worked, but was very slow. The spreadsheet kind of choked at times but as it is just a "do once" step the speed didn't matter in the end. – Jonny Feb 13 '14 at 03:22
  • Please see my answer below for a much higher performance alternative – jordancpaul Nov 09 '17 at 01:35
  • Note: older browsers like IE don't understand &pos;. Also, there are multiple ways to represent it including the decimal and hexidecimal escapes: (' '). There are multiple ways to represent the other entities as well. – Ryan Nov 21 '17 at 19:45
  • @Jonny it may be that you modify spreadsheet itself on each iteration. Make sure you take data from spreadsheet to JS variable, make updates and then update the spreadsheet with one call. It will be fast. And make sure you update all range at once, not cell by cell. – Lukas Liesis May 08 '18 at 08:22
  • 4
    I know this answer is old, but just to make clear for newcomers to JS: attaching random functions, that are not polyfills for some standardized proposal, to global prototypes is a bad idea. – austin_ce Nov 14 '18 at 17:51
24

If you have jQuery, here's a simple solution:

  String.prototype.htmlEscape = function() {
    return $('<div/>').text(this.toString()).html();
  };

Use it like this:

"<foo&bar>".htmlEscape(); -> "&lt;foo&amp;bar&gt"

lambshaanxy
  • 22,552
  • 10
  • 68
  • 92
8

you can use the below method. I have added this in prototype for easier access. I have also used negative look-ahead so it wont mess things, if you call the method twice or more.

Usage:

 var original = "Hi&there";
 var escaped = original.EncodeXMLEscapeChars();  //Hi&amp;there

Decoding is automaticaly handeled in XML parser.

Method :

//String Extenstion to format string for xml content.
//Replces xml escape chracters to their equivalent html notation.
String.prototype.EncodeXMLEscapeChars = function () {
    var OutPut = this;
    if ($.trim(OutPut) != "") {
        OutPut = OutPut.replace(/</g, "&lt;").replace(/>/g, "&gt;").replace(/"/g, "&quot;").replace(/'/g, "&#39;");
        OutPut = OutPut.replace(/&(?!(amp;)|(lt;)|(gt;)|(quot;)|(#39;)|(apos;))/g, "&amp;");
        OutPut = OutPut.replace(/([^\\])((\\\\)*)\\(?![\\/{])/g, "$1\\\\$2");  //replaces odd backslash(\\) with even.
    }
    else {
        OutPut = "";
    }
    return OutPut;
};
sudhansu63
  • 6,025
  • 4
  • 39
  • 52
  • 3
    Underappreciated excellent solution. Ensuring you won't wind up with the infamous &amp; string in your output is beautiful. – Steve Westbrook Dec 09 '16 at 14:08
  • With this code, you just edited *all* instances of String in all application, e.g. `let a = 'foo'` will be affected by this code. Better create helper function instead of extending prototype. – Lukas Liesis May 08 '18 at 08:26
  • Please do not mutate builtin objects because it leads to conflicts and so is a very poor practice. – slikts Jul 10 '19 at 06:32
3

Caution, all the regexing isn't good if you have XML inside XML.
Instead loop over the string once, and substitute all escape characters.
That way, you can't run over the same character twice.

function _xmlAttributeEscape(inputString)
{
    var output = [];

    for (var i = 0; i < inputString.length; ++i)
    {
        switch (inputString[i])
        {
            case '&':
                output.push("&amp;");
                break;
            case '"':
                output.push("&quot;");
                break;
            case "<":
                output.push("&lt;");
                break;
            case ">":
                output.push("&gt;");
                break;
            default:
                output.push(inputString[i]);
        }


    }

    return output.join("");
}
Stefan Steiger
  • 78,642
  • 66
  • 377
  • 442
  • Your observation about XML inside XML seems right to me. Being rigourous, you would want to re-escape ampersands of existing entities (eg. `&amp;`) if you don't want them to break up when decoded. – Bigue Nique Apr 26 '21 at 10:23
1

I originally used the accepted answer in production code and found that it was actually really slow when used heavily. Here is a much faster solution (runs at over twice the speed):

   var escapeXml = (function() {
        var doc = document.implementation.createDocument("", "", null)
        var el = doc.createElement("temp");
        el.textContent = "temp";
        el = el.firstChild;
        var ser =  new XMLSerializer();
        return function(text) {
            el.nodeValue = text;
            return ser.serializeToString(el);
        };
    })();

console.log(escapeXml("<>&")); //&lt;&gt;&amp;
jordancpaul
  • 2,954
  • 1
  • 18
  • 27
1

maybe you can try this,

function encodeXML(s) {
  const dom = document.createElement('div')
  dom.textContent = s
  return dom.innerHTML
}

reference

crown
  • 11
  • 2
1

Adding on to ZZZZBov's answer, I find this a bit cleaner and easier to read:

const encodeXML = (str) =>
    str
        .replace(/&/g, '&amp;')
        .replace(/</g, '&lt;')
        .replace(/>/g, '&gt;')
        .replace(/"/g, '&quot;')
        .replace(/'/g, '&apos;');

Additionally, all five characters can be found here for example: https://www.sitemaps.org/protocol.html

Note that this only encodes values (as other have stated).

Justin
  • 945
  • 12
  • 26
1

It just feels time for an update now that we have string interpolation, and a few other modernisations. And uses object lookup because it really should.

const escapeXml = (unsafe) =>
    unsafe.replace(/[<>&'"]/g, (c) => `&${({
        '<': 'lt',
        '>': 'gt',
        '&': 'amp',
        '\'': 'apos',
        '"': 'quot'
    })[c]};`);
Jim Holmes
  • 67
  • 3
0

if something is escaped from before, you could try this since this will not double escape like many others

function escape(text) {
    return String(text).replace(/(['"<>&'])(\w+;)?/g, (match, char, escaped) => {
        if(escaped) 
            return match

        switch(char) {
            case '\'': return '&quot;'
            case '"': return '&apos;'
            case '<': return '&lt;'
            case '>': return '&gt;'
            case '&': return '&amp;'
        }
    })
}
Lostfields
  • 1,364
  • 1
  • 12
  • 20
0

Technically, &, < and > aren't valid XML entity name characters. If you can't trust the key variable, you should filter them out.

If you want them escaped as HTML entities, you could use something like http://www.strictly-software.com/htmlencode .

-2

This is simple:

sText = ("" + sText).split("<").join("&lt;").split(">").join("&gt;").split('"').join("&#34;").split("'").join("&#39;");
Remi Guan
  • 21,506
  • 17
  • 64
  • 87
Per Ghosh
  • 449
  • 5
  • 10
  • 1
    in what world is this 'simple' compared to the replace methods above? – Sam Holder Sep 22 '16 at 11:39
  • Simple to write, I didn't say that it was simpler compared to those above. It is just different – Per Ghosh Oct 06 '16 at 22:48
  • 5
    I'm stumped trying to think of a worse solution – brettwhiteman Nov 29 '16 at 04:50
  • @developerbmw if you don't want to add a method and don't use jquery, this is one of the best solutions – Per Ghosh Mar 15 '17 at 14:01
  • 1
    @developerbmw for readability, sometimes it's better to write code i a way that enable reading functionality from top to down. Dynamic languages is very easy to make unreadable fast. It depends on the situation. Also using components that is used in different scenarios and need its own logic. A small component may not need functions if logic is only being used in one method. I am a C++ developer and have been coding a lot of C. C is very easy to read and if you know why then you know my argument on this – Per Ghosh Mar 23 '17 at 21:59