166

I am working in a CMS which allows users to enter content. The problem is that when they add symbols ® , it may not display well in all browsers. I would like to set up a list of symbols that must be searched for, and then converted to the corresponding html entity. For example

® => ®
& => &
© => ©
™ => ™

After the conversion, it needs to be wrapped in a <sup> tag, resulting in this:

® => <sup>&reg;</sup>

Because a particular font size and padding style is necessary:

sup { font-size: 0.6em; padding-top: 0.2em; }

Would the JavaScript be something like this?

var regs = document.querySelectorAll('®');
  for ( var i = 0, l = imgs.length; i < l; ++i ) {
  var [?] = regs[i];
  var [?] = document.createElement('sup');
  img.parentNode.insertBefore([?]);
  div.appendChild([?]);
}

Where "[?]" means that there is something that I am not sure about.

Additional Details:

  • I would like to do this with pure JavaScript, not something that requires a library like jQuery, thanks.
  • Backend is Ruby
  • Using RefineryCMS which is built with Ruby on Rails
Yves M.
  • 29,855
  • 23
  • 108
  • 144
JGallardo
  • 11,074
  • 10
  • 82
  • 96
  • What is your backend? If it is php, there are functions to take care of this for you, and I'm sure other languages have them as well. Also, Google: http://developwithstyle.com/articles/2010/06/29/converting-html-entities-to-characters/ – Chris Baker Sep 11 '13 at 19:30
  • 6
    A better solution might be to accept and output UTF-8-encoded text. Every browser in use today supports UTF-8. On the HTML side, you’d want to add `accept-charset="UTF-8"` to your `
    ` tag. On the server, you’d want to make sure your output is UTF-8 encoded, and that your web server tells the browser that it is (via the `Content-Type` header). See http://rentzsch.tumblr.com/post/9133498042/howto-use-utf-8-throughout-your-web-stack If you do all that, and a browser doesn’t display the character correctly, then replacing the character with an entity wouldn’t make any difference.
    – Paul D. Waite Sep 11 '13 at 19:30
  • @Chris working in a CMS built with Ruby on Rails. – JGallardo Sep 11 '13 at 19:33
  • It is wrong to change a character to an HTML entity reference in client-side JavaScript, since client-side JavaScript operates on the DOM, where entities do not exist. Wrapping “®” into `sup` elements tends to cause more problems than it could possibly solve, since in many fonts, “®” is small and in subscript position, so you would reduce it to unrecognizable. – Jukka K. Korpela Sep 11 '13 at 22:05
  • @JukkaK.Korpela, so considering that I need to address that some html entities will not display properly, how would you address it? And wrapping in `` is not an issue since I have tested the specific fonts used for the blog posts, but that is a good point to consider. – JGallardo Sep 11 '13 at 23:56
  • Entities are not rendered; characters are. If you have `®` and `®` in HTML source, the result is exactly the same, since `®` gets turned to `®` before rendering starts. If the character does not look good, it’s a font problem, and that’s what you should address (possibly using a different font for `@` than for text around it, though primarily you should select one font that suits your text, including special symbols inside it). – Jukka K. Korpela Sep 12 '13 at 05:09

19 Answers19

264

You can use regex to replace any character in a given unicode range with its html entity equivalent. The code would look something like this:

var encodedStr = rawStr.replace(/[\u00A0-\u9999<>\&]/g, function(i) {
   return '&#'+i.charCodeAt(0)+';';
});

Or in ES6 (same implementation, but one line):

const encodedStr = rawStr.replace(/[\u00A0-\u9999<>\&]/g, i => '&#'+i.charCodeAt(0)+';')

This code will replace all characters in the given range (unicode 00A0 - 9999, as well as ampersand, greater & less than) with their html entity equivalents, which is simply &#nnn; where nnn is the unicode value we get from charCodeAt.

See it in action here: http://jsfiddle.net/E3EqX/13/ (this example uses jQuery for element selectors used in the example. The base code itself, above, does not use jQuery)

Making these conversions does not solve all the problems -- make sure you're using UTF8 character encoding, make sure your database is storing the strings in UTF8. You still may see instances where the characters do not display correctly, depending on system font configuration and other issues out of your control.

Documentation

eskwayrd
  • 3,691
  • 18
  • 23
Chris Baker
  • 49,926
  • 12
  • 96
  • 115
  • Thank you so much for the jsfiddle. So to implement this. I can just add this to my `.js` file and add the other things to wrap with a ``? – JGallardo Sep 11 '13 at 21:28
  • 2
    @JGallardo I re-wrote the example a little so it adds the `sup` tag (or any other tag), and it is contained in a function: http://jsfiddle.net/E3EqX/4/ . To use this, you need to copy the "encodeAndWrap" function to your project. – Chris Baker Sep 11 '13 at 21:49
  • Just a note: some execution environment (i.e. rhino) don't like unicode escapes in a case insensitive regex – Lorenzo Boccaccia Jul 02 '14 at 14:19
  • @mathias Bynens' answer belwo is a more complete answer. – RavenHursT Nov 04 '14 at 19:21
  • 1
    Although I agree that @mathias Bynens answer is more complete, his solution is 84KB, and that has made me to continue looking for an alternative one. This seems OK-ish, however could one also include charCodes < 65, and between >90 && <97 ? – Florian Mertens Dec 08 '14 at 15:55
  • 2
    `const encodeHTMLEntities = s => s.replace(/[\u00A0-\u9999<>\&]/g, i => ''+i.charCodeAt(0)+';')` – Ray Foss Jun 04 '21 at 16:28
  • Double quote characters (") need to be encoded as well for when working with attribute values. – snesin Sep 29 '22 at 03:52
78

The currently accepted answer has several issues. This post explains them, and offers a more robust solution. The solution suggested in that answer previously had:

var encodedStr = rawStr.replace(/[\u00A0-\u9999<>\&]/gim, function(i) {
  return '&#' + i.charCodeAt(0) + ';';
});

The i flag is redundant since no Unicode symbol in the range from U+00A0 to U+9999 has an uppercase/lowercase variant that is outside of that same range.

The m flag is redundant because ^ or $ are not used in the regular expression.

Why the range U+00A0 to U+9999? It seems arbitrary.

Anyway, for a solution that correctly encodes all except safe & printable ASCII symbols in the input (including astral symbols!), and implements all named character references (not just those in HTML4), use the he library (disclaimer: This library is mine). From its README:

he (for “HTML entities”) is a robust HTML entity encoder/decoder written in JavaScript. It supports all standardized named character references as per HTML, handles ambiguous ampersands and other edge cases just like a browser would, has an extensive test suite, and — contrary to many other JavaScript solutions — he handles astral Unicode symbols just fine. An online demo is available.

Also see this relevant Stack Overflow answer.

mickmackusa
  • 43,625
  • 12
  • 83
  • 136
Mathias Bynens
  • 144,855
  • 52
  • 216
  • 248
  • 26
    Also, the HE library is... 84KB! Autch... Try downloading that on a mobile phone over a lesser connection. A compromise has to be made somewhere.. – Florian Mertens Dec 08 '14 at 15:52
  • 1
    @FlorianMertens After minifying + gzip _he_ is ~24 KB. That’s still big, but at the end of the day if you want to decode HTML entities correctly, you’re gonna need all the data on them — there’s no way around it. If you can find a way to make the library smaller without affecting performance, please submit a pull request. – Mathias Bynens Dec 09 '14 at 17:17
  • @FlorianMertens I know this is a late comment, but to put it into perspective [even in 2014 most sites have images way bigger than 84KB](https://gigaom.com/2014/12/29/the-overweight-web-average-web-page-size-is-up-15-in-2014/), so download isn't the problem. If anything it'd more likely be trouble with the JS engine handling that much script for an "underpowered" phone. – drzaus Jan 19 '17 at 18:46
  • main problem of this answer is as it code block is same as first answer so orher seers can not determine what is saying under the code.usually coder look for the code block and if it is same. – xkeshav Aug 31 '17 at 17:09
  • 4
    @drzaus Images can get away with being big because they store a lot of data, and less compressed data is faster to decode. However program code is different, very often a entire library is added and little use is made of it. The code of the libraries sometimes contain more lines than your own code! Plus few will bother to find/debug lib issues and send bug reports (or even update the lib), so memory leaks or other issues may persist in software with many libs with unchecked code. If someone just wants to encode/escape html-unsafe chars, only a few lines are needed, not 80kb. – bryc Sep 15 '17 at 16:50
  • A thing to keep in mind about this answer is that an emoji is split amongst 2 chars, actually! so emojis wont be converted properly. – Marco Klein Jan 04 '19 at 15:59
  • 1
    @MarcoKlein Yeah, I explain that in my post. It’s indeed a problem that the buggy code snippet suffers from. The solution I point to doesn’t have that problem. (see “including astral symbols!”) – Mathias Bynens Jan 07 '19 at 12:20
  • I’m sure it’s a nice library, but it’s not winning any points calling it “he” and I shudder to think of competition offering a “me” (Markup Entities) library and further SHE (super HTML entities) or other countless variants that pollute the common pronoun namespace – vol7ron Aug 29 '19 at 03:12
39

I had the same problem and created 2 functions to create entities and translate them back to normal characters. The following methods translate any string to HTML entities and back on String prototype

/**
 * Convert a string to HTML entities
 */
String.prototype.toHtmlEntities = function() {
    return this.replace(/./gm, function(s) {
        // return "&#" + s.charCodeAt(0) + ";";
        return (s.match(/[a-z0-9\s]+/i)) ? s : "&#" + s.charCodeAt(0) + ";";
    });
};

/**
 * Create string from HTML entities
 */
String.fromHtmlEntities = function(string) {
    return (string+"").replace(/&#\d+;/gm,function(s) {
        return String.fromCharCode(s.match(/\d+/gm)[0]);
    })
};

You can then use it as following:

var str = "Test´†®¥¨©˙∫ø…ˆƒ∆÷∑™ƒ∆æøπ£¨ ƒ™en tést".toHtmlEntities();
console.log("Entities:", str);
console.log("String:", String.fromHtmlEntities(str));

Output in console:

Entities: &#68;&#105;&#116;&#32;&#105;&#115;&#32;&#101;&#180;&#8224;&#174;&#165;&#168;&#169;&#729;&#8747;&#248;&#8230;&#710;&#402;&#8710;&#247;&#8721;&#8482;&#402;&#8710;&#230;&#248;&#960;&#163;&#168;&#160;&#402;&#8482;&#101;&#110;&#32;&#116;&#163;&#101;&#233;&#115;&#116;
String: Dit is e´†®¥¨©˙∫ø…ˆƒ∆÷∑™ƒ∆æøπ£¨ ƒ™en t£eést 
Community
  • 1
  • 1
ar34z
  • 2,609
  • 2
  • 24
  • 37
  • 1
    This solution works on tvOS too, so it can solve well encoding issues in all cases. – loretoparisi Oct 12 '15 at 15:21
  • 6
    Isn't that a bit extreme? You're converting *everything* to HTML entities, even "safe" characters such as "abc", "123"... even the whitespaces – AJPerez May 12 '17 at 08:44
  • 3
    This is a bad answer. 50%+ of documents on the web contain mostly latin1 with some utf-8. Your encoding of safe characters will increase its size by 500% to 600%, without any advantage. – HoldOffHunger Jul 19 '18 at 16:44
  • Please explain the purpose of the `m` pattern modifier in a pattern that has no anchors. So you mean to use `s` for the pattern containing a dot? – mickmackusa Nov 10 '20 at 01:14
  • I like that bigger hammer approach, it is radical, funny, clever and also useless. Please don't use stuff like that in production code – Slion Jun 29 '23 at 12:28
32

This is an answer for people googling how to encode html entities, since it does not really address the question regarding the <sup> wrapper and symbols entities.

For HTML tag entities (&, <, and >), without any library, if you do not need to support IE < 9, you could create a html element and set its content with Node.textContent:

var str = "<this is not a tag>";
var p = document.createElement("p");
p.textContent = str;
var converted = p.innerHTML;

Here is an example: https://jsfiddle.net/1erdhehv/

antoineMoPa
  • 854
  • 10
  • 20
  • 2
    Why not use innerText instead of textContent? – Rick Sep 30 '19 at 16:54
  • 1
    @Rick, give the MDN document for textContent linked in the answer a shot. Quoting it " textContent and HTMLElement.innerText are easily confused, but the two properties are [different in important ways](https://developer.mozilla.org/en-US/docs/Web/API/Node/textContent#Differences_from_innerText)." – Adarsha Oct 16 '20 at 05:10
  • 1
    This would be a great solution, but it does not encode the " character properly. – Andreas Dec 23 '20 at 14:22
  • 1
    You are right. It looks like this solution may only work for html tag characters (<,>,/). I am tempted to remove it. – antoineMoPa Dec 23 '20 at 15:29
  • @Andreas, I'm curious - in what situation do you need to encode the " character? – kibibu May 10 '22 at 11:43
  • Hi @kibibu, actually I can't remember anymore. – Andreas May 10 '22 at 13:37
  • I should have checked the date!! – kibibu May 16 '22 at 03:50
  • @kibibu You need to escape `"` into `"` In things like that: ``document.write(`click here`)``. Bad style but still handy. (Does encodeURIComponent escape the `"` already? I can't remember.) – Tino Sep 27 '22 at 14:47
  • @Tino yes, with %22. HTML escaping and attribute escaping are two different things and should be treated differently. – kibibu Sep 28 '22 at 00:42
  • @Andreas checkout this one https://stackoverflow.com/a/65592593/14344959 – Harsh Patel Mar 24 '23 at 05:15
29

You can use this.

var escapeChars = {
  '¢' : 'cent',
  '£' : 'pound',
  '¥' : 'yen',
  '€': 'euro',
  '©' :'copy',
  '®' : 'reg',
  '<' : 'lt',
  '>' : 'gt',
  '"' : 'quot',
  '&' : 'amp',
  '\'' : '#39'
};

var regexString = '[';
for(var key in escapeChars) {
  regexString += key;
}
regexString += ']';

var regex = new RegExp( regexString, 'g');

function escapeHTML(str) {
  return str.replace(regex, function(m) {
    return '&' + escapeChars[m] + ';';
  });
};

https://github.com/epeli/underscore.string/blob/master/escapeHTML.js

var htmlEntities = {
    nbsp: ' ',
    cent: '¢',
    pound: '£',
    yen: '¥',
    euro: '€',
    copy: '©',
    reg: '®',
    lt: '<',
    gt: '>',
    quot: '"',
    amp: '&',
    apos: '\''
};

function unescapeHTML(str) {
    return str.replace(/\&([^;]+);/g, function (entity, entityCode) {
        var match;

        if (entityCode in htmlEntities) {
            return htmlEntities[entityCode];
            /*eslint no-cond-assign: 0*/
        } else if (match = entityCode.match(/^#x([\da-fA-F]+)$/)) {
            return String.fromCharCode(parseInt(match[1], 16));
            /*eslint no-cond-assign: 0*/
        } else if (match = entityCode.match(/^#(\d+)$/)) {
            return String.fromCharCode(~~match[1]);
        } else {
            return entity;
        }
    });
};
takdeniz
  • 450
  • 6
  • 9
  • 9
    Manually adding a random subset of encodable characters is likely storing up trouble for yourself and your colleagues down the line. There should be a single authority for which characters should be encoded, probably the browser or failing that a specific library that's likely to be comprehensive and maintained. – user234461 Jun 27 '19 at 11:01
  • 5
    Great point, @user234461. If anyone finds that single authority, inquiring minds (like me) would love to know! – idungotnosn Jan 23 '20 at 18:37
  • 3
    This will miss a lot of html-entities, sunch as `”` `ü` `š` etc. The comprihensive list of all html-entities is quite long: https://www.freeformatter.com/html-entities.html – lofihelsinki Dec 01 '20 at 11:59
  • This worked great for our use case of limited HTML entities – Lisa Schumann Feb 07 '23 at 17:15
  • 1
    The truly comprehensive list is at , or in JSON at . – SamB Feb 24 '23 at 22:58
  • Tried to use that quick as copy/paste but I don't think the implementation is very solid. I was getting very inconsistent results with it. Went for the solution suggesting to use a Node.js module instead: https://stackoverflow.com/a/74034242/3969362 – Slion Jun 29 '23 at 12:24
27

one of the Easy Way for Encode Or Decode HTML-entities
just Call a Function with one argument...

Decode HTML-entities

function decodeHTMLEntities(text) {
  var textArea = document.createElement('textarea');
  textArea.innerHTML = text;
  return textArea.value;
}

Decode HTML-entities (JQuery)

function decodeHTMLEntities(text) {
  return $("<textarea/>").html(text).text();
}

Encode HTML-entities

function encodeHTMLEntities(text) {
  var textArea = document.createElement('textarea');
  textArea.innerText = text;
  return textArea.innerHTML;
}

Encode HTML-entities (JQuery)

function encodeHTMLEntities(text) {
  return $("<textarea/>").text(text).html();
}
Harsh Patel
  • 1,032
  • 6
  • 21
  • 1
    I love the simplicity of this solution. Do you know if it is safe to do it like this? Would text with encoded < and > entities around a script create potential security issues? For example: `<script>` – Rob Knight Mar 08 '22 at 01:04
  • @RobKnight In my test case encoded < and > entities works perfectly. event I already used this solution in my project from January, 2021. still today I doesn't found any issue in this trick. please check this example https://jsfiddle.net/dutf5kqz/ I hope this helps to you . – Harsh Patel Mar 08 '22 at 12:08
  • 2
    Im 2022 this should be the accepted answer. – Justin Vincent Oct 31 '22 at 03:23
  • 1
    Great solution!! I would rather rename `textArea` to `encoder` – Jaime Jul 06 '23 at 14:42
  • 1
    Thanks, does this solution cover the cases he library solves like astral symbols? https://stackoverflow.com/a/23834738/21386264 – Cemstrian Aug 28 '23 at 13:21
  • 1
    @Cemstrian happy to hear, that helps to you :) – Harsh Patel Aug 28 '23 at 14:34
  • @HarshPatel does it fix the all problems out there in the wild or is it a generic solution? I will encode the HTML before sending to Express/MongooDB, is it ok to use for this setup? – Cemstrian Aug 29 '23 at 02:51
  • 1
    @Cemstrian The provided functions for HTML entity encoding and decoding are helpful but have limitations. They aren't a one-size-fits-all solution due to browser dependency, potential performance issues, and contextual constraints. While encoding HTML before sending to Express and MongoDB can add a layer of security, remember that these functions should be part of a comprehensive security strategy that includes input validation and proper output encoding. – Harsh Patel Aug 29 '23 at 04:33
  • 1
    @Cemstrian The provided functions use the browser's Document Object Model (DOM) API to work with HTML entities. This means they rely on the browser environment to function correctly. If you were to use these functions in a non-browser environment (like a server-side script), they might not work as intended or might not work at all. – Harsh Patel Aug 29 '23 at 04:34
7

If you're already using jQuery, try html().

$('<div>').text('<script>alert("gotcha!")</script>').html()
// "&lt;script&gt;alert("gotcha!")&lt;/script&gt;"

An in-memory text node is instantiated, and html() is called on it.

It's ugly, it wastes a bit of memory, and I have no idea if it's as thorough as something like the he library but if you're already using jQuery, maybe this is an option for you.

Taken from blog post Encode HTML entities with jQuery by Felix Geisendörfer.

Jared Beck
  • 16,796
  • 9
  • 72
  • 97
  • 5
    To avoid instantiating a node every time, you can save the node: `var converter=$("
    ");` and later reuse it: `html1=converter.text(text1).html(); html2=converter.text(text2).html();`...
    – FrancescoMM Mar 17 '15 at 17:22
7

If you want to avoid encode html entities more than once

function encodeHTML(str){
    return str.replace(/([\u00A0-\u9999<>&])(.|$)/g, function(full, char, next) {
      if(char !== '&' || next !== '#'){
        if(/[\u00A0-\u9999<>&]/.test(next))
          next = '&#' + next.charCodeAt(0) + ';';

        return '&#' + char.charCodeAt(0) + ';' + next;
      }

      return full;
    });
}

function decodeHTML(str){
    return str.replace(/&#([0-9]+);/g, function(full, int) {
        return String.fromCharCode(parseInt(int));
    });
}

# Example

var text = "<a>Content &#169; <#>&<&#># </a>";

text = encodeHTML(text);
console.log("Encode 1 times: " + text);

// &#60;a&#62;Content &#169; &#60;#&#62;&#38;&#60;&#38;#&#62;# &#60;/a&#62;

text = encodeHTML(text);
console.log("Encode 2 times: " + text);

// &#60;a&#62;Content &#169; &#60;#&#62;&#38;&#60;&#38;#&#62;# &#60;/a&#62;

text = decodeHTML(text);
console.log("Decoded: " + text);

// <a>Content © <#>&<&#># </a>
StefansArya
  • 2,802
  • 3
  • 24
  • 25
  • This is only useful if you have a mixed partially escaped text to start with, and it introduces bugs as it can't properly encode all strings: `<#>` would come out as `<#>` – Rick Oct 16 '20 at 19:32
  • @Rick Thanks for notice me about that, I have updated my answer to make it more better. – StefansArya Oct 17 '20 at 03:44
6

HTML Special Characters & its ESCAPE CODES

Reserved Characters must be escaped by HTML: We can use a character escape to represent any Unicode character [Ex: & - U+00026] in HTML, XHTML or XML using only ASCII characters. Numeric character references [Ex: ampersand(&) - &#38;] & Named character references [Ex: &amp;] are types of character escape used in markup.


Predefined Entities

    Original Character     XML entity replacement    XML numeric replacement  
                  <                                    &lt;                                           &#60;                    
                  >                                     &gt;                                         &#62;                    
                  "                                     &quot;                                      &#34;                    
                  &                                   &amp;                                       &#38;                    
                   '                                    &apos;                                      &#39;                    

To display HTML Tags as a normal form in web page we use <pre>, <code> tags or we can escape them. Escaping the string by replacing with any occurrence of the "&" character by the string "&amp;" and any occurrences of the ">" character by the string "&gt;". Ex: stackoverflow post

function escapeCharEntities() {
    var map = {
        "&": "&amp;",
        "<": "&lt;",
        ">": "&gt;",
        "\"": "&quot;",
        "'": "&apos;"
    };
    return map;
}

var mapkeys = '', mapvalues = '';
var html = {
    encodeRex : function () {
        return  new RegExp(mapkeys, 'g'); // "[&<>"']"
    }, 
    decodeRex : function () {
        return  new RegExp(mapvalues, 'g'); // "(&amp;|&lt;|&gt;|&quot;|&apos;)"
    },
    encodeMap : JSON.parse( JSON.stringify( escapeCharEntities () ) ), // json = {&: "&amp;", <: "&lt;", >: "&gt;", ": "&quot;", ': "&apos;"}
    decodeMap : JSON.parse( JSON.stringify( swapJsonKeyValues( escapeCharEntities () ) ) ),
    encode : function ( str ) {
        var encodeRexs = html.encodeRex();
        console.log('Encode Rex: ', encodeRexs); // /[&<>"']/gm
        return str.replace(encodeRexs, function(m) { console.log('Encode M: ', m); return html.encodeMap[m]; }); // m = < " > SpecialChars
    },
    decode : function ( str ) {
        var decodeRexs = html.decodeRex();
        console.log('Decode Rex: ', decodeRexs); // /(&amp;|&lt;|&gt;|&quot;|&apos;)/g
        return str.replace(decodeRexs, function(m) { console.log('Decode M: ', m); return html.decodeMap[m]; }); // m = &lt; &quot; &gt;
    }
};

function swapJsonKeyValues ( json ) {
    var count = Object.keys( json ).length;
    var obj = {};
    var keys = '[', val = '(', keysCount = 1;
    for(var key in json) {
        if ( json.hasOwnProperty( key ) ) {
            obj[ json[ key ] ] = key;
            keys += key;
            if( keysCount < count ) {
                val += json[ key ]+'|';
            } else {
                val += json[ key ];
            }
            keysCount++;
        }
    }
    keys += ']';    val  += ')';
    console.log( keys, ' == ', val);
    mapkeys = keys;
    mapvalues = val;
    return obj;
}

console.log('Encode: ', html.encode('<input type="password" name="password" value=""/>') ); 
console.log('Decode: ', html.decode(html.encode('<input type="password" name="password" value=""/>')) );

O/P:
Encode:  &lt;input type=&quot;password&quot; name=&quot;password&quot; value=&quot;&quot;/&gt;
Decode:  <input type="password" name="password" value=""/>
mickmackusa
  • 43,625
  • 12
  • 83
  • 136
Yash
  • 9,250
  • 2
  • 69
  • 74
  • This is great for adding html source code in Json format into iframe srcdoc string. – Nime Cloud May 03 '20 at 15:00
  • 1
    This doesn't include ®, so it won't help the OP. Additionally, this JS is so much more complicated than many of the other solutions, even the ones that only use a short mapping like this. swapJsonKeyValues is poorly named as it has required side effects (defining mapkeys and mapvalues) – Rick Oct 16 '20 at 19:22
  • @mickmackusa I have updated the post with the debug values. `m` holds the special characters of an input String. – Yash Nov 11 '20 at 06:49
  • If there is any mistake in the post. So, please try to correct the post and provide the comments. – Yash Nov 11 '20 at 07:08
5
var htmlEntities = [
            {regex:/&/g,entity:'&amp;'},
            {regex:/>/g,entity:'&gt;'},
            {regex:/</g,entity:'&lt;'},
            {regex:/"/g,entity:'&quot;'},
            {regex:/á/g,entity:'&aacute;'},
            {regex:/é/g,entity:'&eacute;'},
            {regex:/í/g,entity:'&iacute;'},
            {regex:/ó/g,entity:'&oacute;'},
            {regex:/ú/g,entity:'&uacute;'}
        ];

total = <some string value>

for(v in htmlEntities){
    total = total.replace(htmlEntities[v].regex, htmlEntities[v].entity);
}

A array solution

  • 3
    Please explain how this solves the problem in a unique better way than above. At glance, it would appear that this solution is slower because it modifies the string in multiple passes instead of all at one. However, I may be incorrect. Either way, you must back up you post with an explanation. – Jack G May 21 '18 at 23:59
  • Its an alternative, you can use regex directly from the array... : D – Cesar De la Cruz Jun 25 '18 at 19:10
  • This is one regex for each character (for v in ....). If you wanted to cover all of UTF-8, this would be 65,000 regex's and 65,000 lines of execution. – HoldOffHunger Jul 19 '18 at 16:51
  • 3
    I'm only interested in converting three characters to entities so this answer is better in my case and i'm glad it was here – Drew Sep 10 '18 at 20:35
2

Sometimes you just want to encode every character... This function replaces "everything but nothing" in regxp.

function encode(e){return e.replace(/[^]/g,function(e){return"&#"+e.charCodeAt(0)+";"})}

function encode(w) {
  return w.replace(/[^]/g, function(w) {
    return "&#" + w.charCodeAt(0) + ";";
  });
}

test.value=encode(document.body.innerHTML.trim());
<textarea id=test rows=11 cols=55>www.WHAK.com</textarea>
Dave Brown
  • 923
  • 9
  • 6
  • 1
    Replace the `^` by a `.` to conserve emojis: `function encode(e){return e.replace(/[.]/g,function(e){return""+e.charCodeAt(0)+";"})}`. – Swiss Mister Mar 08 '17 at 14:22
2

On NodeJs install html-entities

then:

import {encode} from "html-entities";
encode(str);
Daniel De León
  • 13,196
  • 5
  • 87
  • 72
1

Checkout the tutorial from Ourcodeworld Ourcodeworld - encode and decode html entities with javascript

Most importantly, the he library example

he.encode('foo © bar ≠ baz ???? qux');
// → 'foo &#xA9; bar &#x2260; baz &#x1D306; qux'

// Passing an `options` object to `encode`, to explicitly encode all symbols:
he.encode('foo © bar ≠ baz ???? qux', {
 'encodeEverything': true
});

he.decode('foo &copy; bar &ne; baz &#x1D306; qux');
// → 'foo © bar ≠ baz ???? qux'

This library would probably make your coding easier and better managed. It is popular, regularly updated and follows the HTML spec. It itself has no dependencies, as can be seen in the package.json

Tobias Mühl
  • 1,788
  • 1
  • 18
  • 30
jking
  • 194
  • 2
  • 9
  • OP asked for vanilla JS and vanilla JS offers element.innerText. If there's an advantage to the library please add it to your answer. – Rick Oct 16 '20 at 12:43
1

Here is how I implemented the encoding. I took inspiration from the answers given above.

function encodeHTML(str) {
  const code = {
      ' ' : '&nbsp;',
      '¢' : '&cent;',
      '£' : '&pound;',
      '¥' : '&yen;',
      '€' : '&euro;', 
      '©' : '&copy;',
      '®' : '&reg;',
      '<' : '&lt;', 
      '>' : '&gt;',  
      '"' : '&quot;', 
      '&' : '&amp;',
      '\'' : '&apos;'
  };
  return str.replace(/[\u00A0-\u9999<>\&''""]/gm, (i)=>code[i]);
}

// TEST
console.log(encodeHTML("Dolce & Gabbana"));
console.log(encodeHTML("Hamburgers < Pizza < Tacos"));
console.log(encodeHTML("Sixty > twelve"));
console.log(encodeHTML('Stuff in "quotation marks"'));
console.log(encodeHTML("Schindler's List"));
console.log(encodeHTML("<>"));
Dforrunner
  • 34
  • 4
0

htmlentities() converts HTML Entities

So we build a constant that will contain our html tags we want to convert.

const htmlEntities = [ 
    {regex:'&',entity:'&amp;'},
    {regex:'>',entity:'&gt;'},
    {regex:'<',entity:'&lt;'} 
  ];

We build a function that will convert all corresponding html characters to string : Html ==> String

 function htmlentities (s){
    var reg; 
    for (v in htmlEntities) {
      reg = new RegExp(htmlEntities[v].regex, 'g');
      s = s.replace(reg, htmlEntities[v].entity);
    }
    return s;
  }

To decode, we build a reverse function that will convert all string to their equivalent html . String ==> html

 function  html_entities_decode (s){
    var reg; 
    for (v in htmlEntities) {
      reg = new RegExp(htmlEntities[v].entity, 'g');
      s = s.replace(reg, htmlEntities[v].regex);
    }
    return s;
  
   }

After, We can encode all others special characters (é è ...) with encodeURIComponent()

Use Case

 var s  = '<div> God bless you guy   </div> '
 var h = encodeURIComponent(htmlentities(s));         /** To encode */
 h =  html_entities_decode(decodeURIComponent(h));     /** To decode */
Rehum
  • 534
  • 5
  • 14
0

I wanted to share my solution here for other readers that stumble upon this thread.

I deliberately escape quotes here so that the encoded value is "attribute safe". < and > are also deliberately encoded to safely escape any HTML tags.

This works by using the u flag on the RegExp, which will match any full Unicode code point. I also use codePointAt instead of charCodeAt so I can generate any full Unicode code point. I then ensure any character matched is not a specific set of ASCII characters, and finally, encode each Unicode character found.

This also "ignores" any currently escaped character sequences, by matching against those first.

function encodeHTMLEntities(str) {
  if (!str)
    return str;

  // First:
  //   Match any currently encoded characters first
  //   (i.e. `&#67;`, `&#x43;`, or '&amp;')
  //   Finally, match on any character with `.`, using
  //   the `u` RegExp flag to match full Unicode code points.
  // Second:
  //   1) Already encoded characters must be at least four
  //      characters (i.e. `&#1;`), and must start with an
  //      '&' character. If this is true, then the match
  //      is an already encoded character sequence, so just
  //      return it.
  //   2) Otherwise, see if the character is a single UTF-16
  //      character, and is in our whitelist of allowed
  //      characters (common ASCII, without quotes or < or >).
  //      If this is the case, then don't encode the character,
  //      and simply return it.
  //   3) Finally, use codePointAt to encode the Unicode character.
  return str.replace(/&#[0-9]+;|&#x[0-9a-fA-F]+;|&[0-9a-zA-Z]{2,};|./gu, (m) => {
    // #1, is this an already encoded character sequence?
    // If so, just return it.
    if (m.length >= 4 && m[0] === '&')
      return m;

    // #2, is this one of our whitelisted ASCII characters
    // (not including quotes or < or >)
    if (m.length === 1 && m.match(/[a-zA-Z0-9\s\t\n\r~`!@#$%^&*_+=(){}[\]/\\,?:;|.-]/))
      return m;

    // #3 Otherwise, encode it as unicode
    return `&#${m.codePointAt(0)};`;
  });
}

Example:

console.log(encodeHTMLEntities('&amp;    testing  &#x43; <stuff> "things" wow! &#67;'))

Outputs:

&amp; &#128522; &#128578; &#127873; testing &#129315; &#x43; &#60;stuff&#62; &#34;things&#34; wow! &#67;
th317erd
  • 304
  • 3
  • 11
-1

<!DOCTYPE html>
<html>
<style>
button {
backround: #ccc;
padding: 14px;
width: 400px;
font-size: 32px;
}
#demo {
font-size: 20px;
font-family: Arial;
font-weight: bold;
}
</style>
<body>

<p>Click the button to decode.</p>

<button onclick="entitycode()">Html Code</button>

<p id="demo"></p>


<script>
function entitycode() {
  var uri = "quotation  = ark __ &apos; = apostrophe  __ &amp; = ampersand __ &lt; = less-than __ &gt; = greater-than __  non- = reaking space __ &iexcl; = inverted exclamation mark __ &cent; = cent __ &pound; = pound __ &curren; = currency __ &yen; = yen __ &brvbar; = broken vertical bar __ &sect; = section __ &uml; = spacing diaeresis __ &copy; = copyright __ &ordf; = feminine ordinal indicator __ &laquo; = angle quotation mark (left) __ &not; = negation __ &shy; = soft hyphen __ &reg; = registered trademark __ &macr; = spacing macron __ &deg; = degree __ &plusmn; = plus-or-minus  __ &sup2; = superscript 2 __ &sup3; = superscript 3 __ &acute; = spacing acute __ &micro; = micro __ &para; = paragraph __ &middot; = middle dot __ &cedil; = spacing cedilla __ &sup1; = superscript 1 __ &ordm; = masculine ordinal indicator __ &raquo; = angle quotation mark (right) __ &frac14; = fraction 1/4 __ &frac12; = fraction 1/2 __ &frac34; = fraction 3/4 __ &iquest; = inverted question mark __ &times; = multiplication __ &divide; = division __ &Agrave; = capital a, grave accent __ &Aacute; = capital a, acute accent __ &Acirc; = capital a, circumflex accent __ &Atilde; = capital a, tilde __ &Auml; = capital a, umlaut mark __ &Aring; = capital a, ring __ &AElig; = capital ae __ &Ccedil; = capital c, cedilla __ &Egrave; = capital e, grave accent __ &Eacute; = capital e, acute accent __ &Ecirc; = capital e, circumflex accent __ &Euml; = capital e, umlaut mark __ &Igrave; = capital i, grave accent __ &Iacute; = capital i, acute accent __ &Icirc; = capital i, circumflex accent __ &Iuml; = capital i, umlaut mark __ &ETH; = capital eth, Icelandic __ &Ntilde; = capital n, tilde __ &Ograve; = capital o, grave accent __ &Oacute; = capital o, acute accent __ &Ocirc; = capital o, circumflex accent __ &Otilde; = capital o, tilde __ &Ouml; = capital o, umlaut mark __ &Oslash; = capital o, slash __ &Ugrave; = capital u, grave accent __ &Uacute; = capital u, acute accent __ &Ucirc; = capital u, circumflex accent __ &Uuml; = capital u, umlaut mark __ &Yacute; = capital y, acute accent __ &THORN; = capital THORN, Icelandic __ &szlig; = small sharp s, German __ &agrave; = small a, grave accent __ &aacute; = small a, acute accent __ &acirc; = small a, circumflex accent __ &atilde; = small a, tilde __ &auml; = small a, umlaut mark __ &aring; = small a, ring __ &aelig; = small ae __ &ccedil; = small c, cedilla __ &egrave; = small e, grave accent __ &eacute; = small e, acute accent __ &ecirc; = small e, circumflex accent __ &euml; = small e, umlaut mark __ &igrave; = small i, grave accent __ &iacute; = small i, acute accent __ &icirc; = small i, circumflex accent __ &iuml; = small i, umlaut mark __ &eth; = small eth, Icelandic __ &ntilde; = small n, tilde __ &ograve; = small o, grave accent __ &oacute; = small o, acute accent __ &ocirc; = small o, circumflex accent __ &otilde; = small o, tilde __ &ouml; = small o, umlaut mark __ &oslash; = small o, slash __ &ugrave; = small u, grave accent __ &uacute; = small u, acute accent __ &ucirc; = small u, circumflex accent __ &uuml; = small u, umlaut mark __ &yacute; = small y, acute accent __ &thorn; = small thorn, Icelandic __ &yuml; = small y, umlaut mark";
  var enc = encodeURI(uri);
  var dec = decodeURI(enc);
  var res = dec;
  document.getElementById("demo").innerHTML = res;
}
</script>

</body>
</html>
Vinod Kumar
  • 1,191
  • 14
  • 12
  • This doesn't appear to answer the question, and it's a code only answer. Please provide a description of what the code is doing and how it relates to the question. – Rick Oct 16 '20 at 12:47
-1
function htmlEntityReplacer(encoded_text) {
    var decoded_text = encoded_text;

    const all_entities = [{ /* source: https://www.w3schools.com/html/html_entities.asp */
        encoded: `&nbsp;`,
        decoded: ` `
    }, {
        encoded: `&lt;`,
        decoded: `<`
    }, {
        encoded: `&gt;`,
        decoded: `>`
    }, {
        encoded: `&amp;`,
        decoded: `&`
    }, {
        encoded: `&quot;`,
        decoded: `"`
    }, {
        encoded: `&apos;`,
        decoded: `'`
    }, {
        encoded: `&cent;`,
        decoded: `¢`
    }, {
        encoded: `&pound;`,
        decoded: `£`
    }, {
        encoded: `&yen;`,
        decoded: `yen`
    }, {
        encoded: `&euro;`,
        decoded: `€`
    }, {
        encoded: `&copy;`,
        decoded: `©`
    }, {
        encoded: `&reg;`,
        decoded: `®`
    }]
    for (i = 0; i < all_entities.length; i++) {
        var decoded_text = decoded_text.replace(new RegExp(all_entities[i].encoded, 'g'), all_entities[i].decoded)
    }
    return decoded_text;
}

// For node or vanilla

estemendoza
  • 3,023
  • 5
  • 31
  • 51
  • The question is about encoding, not decoding, so this is backwards. Also, it misses out a *lot* of entities (or it includes a lot of characters that don't need encoding. Also the very first value is wrong (the decoded value is a regular space instead of a non-breaking space). – Quentin Jan 11 '22 at 15:58
-2

You can use the charCodeAt() method to check if the specified character has a value higher than 127 and convert it to a numeric character reference using toString(16).

bolistene
  • 55
  • 6