How can I remove non-printable unicode characters in a multi-language input?
When users with different localizations paste strings they will sometimes unintentionally embed non-printing characters. For example:
var weird = "%E2%80%AA%E2%80%8ETest%E2%80%AC"
var displaysAs = decodeURI(weird); // Users see only "Test"
But I can't figure out how to strip the non-printing characters in a way that doesn't impact other languages like these:
encodeURI("شنط") = "%D8%B4%D9%86%D8%B7"
encodeURI("戦艦帝国") = "%E6%88%A6%E8%89%A6%E5%B8%9D%E5%9B%BD"
For example, the following attempt to repair the weird example above doesn't work:
var weird = "%E2%80%AA%E2%80%8ETest%E2%80%AC";
var displaysAs = decodeURI(weird);
var stillWeird = encodeURI(displaysAs.replace(/\s/g, ""));
// value is again "%E2%80%AA%E2%80%8ETest%E2%80%AC"
console.log('before:', weird);
console.log('after:', displaysAs);
console.log('again:', stillWeird);
.as-console-wrapper{min-height:100%}
As noted in the comments, this is largely a specification problem. I don't have an enumeration of non-printing unicode expressions. I can only observe that one can paste a unicode string into a browser Input and not be aware that it has undisplayed characters in it. I assume that some logic in the browser determines whether each unicode character will display something. This problem would be solved if I can apply that same logic to the underlying string in order to get the "display string."
Put another way: For any two unicode strings that look identical on the browser, I need a transformation that guarantees that their values are identical.