More Efficiently replace escaped unicode characters on a page?

Question

I have a page which includes escaped Unicode characters. (For example the characters 漢字 are escaped as \u6F22\u5B57). This page shows how you can use the unescape() method to convert the escaped \u6F22\u5B57 to 漢字. I have a method that converts all of the Unicode escaped characters, but it is not very fast.

function DecodeAllUnicodeCharacters (strID)
{
    var strstr = $(strID).innerHTML;
    var arrEncodeChars = strstr.match(/\\u[0-9A-Z]{4,6}/g);
    for (var ii = 0; ii < arrEncodeChars.length; ii++) {
        var sUnescaped = eval("unescape('"+arrEncodeChars[ii]+"')");
        strstr = strstr.replace(arrEncodeChars[ii], sUnescaped);
    }
    $(strID).innerHTML = strstr;
}

The part that takes longest is setting the innerHTML here: $(strID).innerHTML = strstr; Is there a good way to replace the characters without redoing the innerHTML of the whole page?

I don't think this should need to be done on the client side. — Bergi, Oct 02 '14 at 23:40
I really wish I could do this on the server side. I really do. But if I want to send the characters unencoded, then I need to replace all of the strings in the code with more modern strings. And that re-write will probably take a year. My boss is not going to accept that, so I have to add a javascript hack. — Holtorf, Oct 03 '14 at 00:27
What do you mean, "modern strings"? Shouldn't it be possible to do that rewrite programmatically? — Bergi, Oct 03 '14 at 00:36
By modern strings I mean something like std::string which can handle 32 bit unicode characters. I may be exaggerating by saying a year, but our codebase is a 10+ year mess of string casts into and out of string-like objects that only our company uses. This is not the sort of things that I could just ctrl+R replace. — Holtorf, Oct 03 '14 at 01:21
Another reason is that I will probably have to change a bunch of company libraries, and I am going to have a hard time telling everyone else that they need to change their code just because I want to make my job easier. I think I'm stuck with making the change in JavaScript. — Holtorf, Oct 03 '14 at 01:26
OK, I see. But shouldn't it still be possible to do the escaping in the output methods (where you generate web responses, ie. HTML/JS/CSS/JSON etc) of your server, so that the pages are server with the correct content-type — Bergi, Oct 03 '14 at 10:13
Sometimes I can, so I already do that, but sometimes I can't due to the templates. — Holtorf, Oct 03 '14 at 15:21

score 2 · Accepted Answer · edited May 23 '17 at 10:33

The reason it is slow to set innerHTML is because that causes the browser to parse it as HTML, and if there are child elements they get recreated which is extra slow. Instead we need to find just the text nodes and selectively treat them if they contain escaped content. I base the following on a previous question and demonstrated in a fiddle.

Element.addMethods({
    // element is Prototype-extended HTMLElement
    // nodeType is a Node.* constant
    // callback is a function where first argument is a Node
    forEachDescendant: function (element, nodeType, callback)
    {
        element = $(element);
        if (!element) return;
        var node = element.firstChild;
        while (node != null) {
            if (node.nodeType == nodeType) {
                callback(node);
            }

            if(node.hasChildNodes()) {
                node = node.firstChild;
            }
            else {
                while(node.nextSibling == null && node.parentNode != element) {
                    node = node.parentNode;
                }
                node = node.nextSibling;
            }
        }
    },
    decodeUnicode: function (element)
    {
        var regex = /\\u([0-9A-Z]{4,6})/g;
        Element.forEachDescendant(element, Node.TEXT_NODE, function(node) {
            // regex.test fails faster than regex.exec for non-matching nodes
            if (regex.test(node.data)) {
                // only update when necessary
                node.data = node.data.replace(regex, function(_, code) {
                    // code is hexidecimal captured from regex
                    return String.fromCharCode(parseInt(code, 16));
                });
            }
        });
    }
});

The benefit of element.addMethods, aside from aesthetics, is the functional pattern. You can use decodeUnicode several ways:

// single element
$('element_id').decodeUnicode();
// or
Element.decodeUnicode('element_id');

// multiple elements
$$('p').each(Element.decodeUnicode);
// or
$$('p').invoke('decodeUnicode');

I deliberately wrote my own decode method instead of `unescape` because (1) `unescape` is deprecated (2) it only works in `eval` and `eval` is slow and dangerous (3) it wasn't complicated. — clockworkgeek, Oct 03 '14 at 15:10
I just realised that [`unescape`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/unescape) has no effect in the original question since it only works for URL encodings. It appeared to work because the use of `eval` put data into a [JS string](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String) which uses \uXXXX style encoding. You could get the same effect with just `eval("'"+arrEncodeChars[ii]+"'")` — clockworkgeek, Oct 04 '14 at 11:11

score 0 · Answer 2 · answered Oct 03 '14 at 00:09

0

Is this what you wanted ?

function DecodeAllUnicodeCharacters(id) {
    $(id).innerHTML = decodeURI($(id).innerHTML);
}

answered Oct 03 '14 at 00:09

denim2x

119
1
3

No. You see I only want to decode the \uXXXX characters. If you run decodeURI() over html elements, then it malforms the html elements. So it is pretty essential that I only run unecode() on the characters that I want to change, rather than the entire page. Also, $(id).innerHTML = decodeURI($(id).innerHTML; runs at about the same speed as the code above, because most of the time is taken by $(id).innerHTML = "the html for the page". – Holtorf Oct 03 '14 at 00:22

More Efficiently replace escaped unicode characters on a page?

2 Answers2