3

My Unicode-related vocabulary isn't very good, so sorry for the verbose question.

A character like ã can be represented by \u00e3 (Latin small letter a with tilde), or \u0061 (Latin small letter a) in combination with combining diacritical mark \u0303 (combining tilde). Now, in Java, in order to match any Unicode letter, I'd look for [\p{L}], but JavaScript doesn't understand that, so I'll have to look for the individual code points (\unnnn). How can I start with an ã and figure out all the various ways it can be represented in Unicode so I can include them in my regular expression in \unnnn format?

T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875
Christian
  • 6,070
  • 11
  • 53
  • 103
  • Your original question probably fell afoul the *"Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow"* rule. I've edited it to try to implement the second half of that close reason (*"Instead, describe the problem and what has been done so far to solve it."*) as best I can – T.J. Crowder Oct 06 '15 at 13:59
  • You might want to change the focus of the question, though, because I think you're hitting the X/Y problem: You've asked a question about X ("how do I get the list of Unicode ways to represent ã so I can include them in a regex in `\unnnn` form) , but your *real* question is Y ("How do I reliably detect ã in a JS regular expression whether it's written `\u00e3` or `\u0061\u0303`)?" – T.J. Crowder Oct 06 '15 at 14:01
  • Very interesting question. – T.J. Crowder Oct 06 '15 at 14:01
  • 3
    You are asking two questions. One is "How can I do stuff like `\p{L}`; the other is "how do I decompose a Unicode character?" The first question is discussed [here](http://stackoverflow.com/questions/280712/javascript-unicode-regexes); the second [here](http://stackoverflow.com/questions/7772553/javascript-unicode-normalization). – Raymond Chen Oct 06 '15 at 14:02
  • 1
    Thanks. @raymond-chen: I really want to know how to get it decomposed and printed to screen, I guess... Which seems to be the answer to the first question, if I'm not mistaken... – Christian Oct 06 '15 at 14:05
  • Decomposition was your second question. The first was generalized character classes. – Raymond Chen Oct 06 '15 at 14:06

1 Answers1

2

How can I start with an ã and figure out all the various ways it can be represented in Unicode

You're looking for the Unicode Equivalence.

The 2 forms you mentioned are the composed form, and the decomposed form. To get cannonically equivalent Unicode forms, you could use String.prototype.normalize().

  • Important: Check the link for Browser Compatibility.

str.normalize([form]) accepts the following forms:

  • NFC — Normalization Form Canonical Composition.
  • NFD — Normalization Form Canonical Decomposition.
  • NFKC — Normalization Form Compatibility Composition.
  • NFKD — Normalization Form Compatibility Decomposition.

Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed.

Sequences that are defined as compatible are assumed to have possibly distinct appearances, but the same meaning in some contexts.

Quote from Wikipedia

-Choose the equivalence form you like.


For example, using the Latin small letter a with tilde in Compatibility Form:

var char = "ã";
var nfkc = char.normalize('NFKC');
var nfkd = char.normalize('NFKD');

// Returns bytes as Unicode escaped sequences
function escapeUnicode(str){
    var i;
    var result = "";
    for( i = 0; i < str.length; ++i){
        var c = str.charCodeAt(i);
        c = c.toString(16).toUpperCase();
        while (c.length < 4) {
            c = "0" + c;
        }
        result += "\\u" + c;
    }
    return result;
}

var char = "ã";
var nfkc = char.normalize('NFKC');
var nfkd = char.normalize('NFKD');

document.write('<br />NFKC: ' + escapeUnicode(nfkc));
document.write('<br />NFKD: ' + escapeUnicode(nfkd));
Mariano
  • 6,423
  • 4
  • 31
  • 47