How can I start with an ã and figure out all the various ways it can be represented in Unicode
You're looking for the Unicode Equivalence.
The 2 forms you mentioned are the composed form
, and the decomposed form
. To get cannonically equivalent Unicode forms, you could use String.prototype.normalize()
.
- Important: Check the link for Browser Compatibility.
str.normalize([form])
accepts the following forms:
- NFC — Normalization Form Canonical Composition.
- NFD — Normalization Form Canonical Decomposition.
- NFKC — Normalization Form Compatibility Composition.
- NFKD — Normalization Form Compatibility Decomposition.
Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed.
Sequences that are defined as compatible are assumed to have possibly distinct appearances, but the same meaning in some contexts.
Quote from Wikipedia
-Choose the equivalence form you like.
For example, using the Latin small letter a with tilde
in Compatibility Form:
var char = "ã";
var nfkc = char.normalize('NFKC');
var nfkd = char.normalize('NFKD');
// Returns bytes as Unicode escaped sequences
function escapeUnicode(str){
var i;
var result = "";
for( i = 0; i < str.length; ++i){
var c = str.charCodeAt(i);
c = c.toString(16).toUpperCase();
while (c.length < 4) {
c = "0" + c;
}
result += "\\u" + c;
}
return result;
}
var char = "ã";
var nfkc = char.normalize('NFKC');
var nfkd = char.normalize('NFKD');
document.write('<br />NFKC: ' + escapeUnicode(nfkc));
document.write('<br />NFKD: ' + escapeUnicode(nfkd));