13

While reviewing JavaScript concepts, I found String.normalize(). This is not something that shows up in W3School's "JavaScript String Reference", and, hence, it is the reason I might have missed it before.

I found more information about it in HackerRank which states:

Returns a string containing the Unicode Normalization Form of the calling string's value.

With the example:

var s = "HackerRank";
console.log(s.normalize());
console.log(s.normalize("NFKC"));

having as output:

HackerRank
HackerRank

Also, in GeeksForGeeks:

The string.normalize() is an inbuilt function in javascript which is used to return a Unicode normalisation form of a given input string.

with the example:

<script> 
  
  // Taking a string as input. 
  var a = "GeeksForGeeks"; 
    
  // calling normalize function. 
  b = a.normalize('NFC') 
  c = a.normalize('NFD') 
  d = a.normalize('NFKC') 
  e = a.normalize('NFKD') 
    
  // Printing normalised form. 
  document.write(b +"<br>"); 
  document.write(c +"<br>"); 
  document.write(d +"<br>"); 
  document.write(e); 
    
</script> 

having as output:

GeeksForGeeks
GeeksForGeeks
GeeksForGeeks
GeeksForGeeks

Maybe the examples given are just really bad as they don't allow me to see any change.

I wonder... what's the point of this method?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Tiago Martins Peres
  • 14,289
  • 18
  • 86
  • 145
  • 14
    Let me start by saying that w3schools.com is not an official reference. It has no affiliation with the W3C. Here's a proper resource: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize –  Jul 21 '20 at 11:32
  • 1
    I know that @ChrisG but the content is usually very good. – Tiago Martins Peres Jul 21 '20 at 11:33
  • 8
    No it's not, it not as abysmal as it used to be, but this community especially is still suffering from it. The mere existence of this question kind of proves my point, I guess? –  Jul 21 '20 at 11:34
  • @ChrisG GeeksForGeeks references a similar link to the one you shared - https://devdocs.io/javascript/global_objects/string/normalize . The difference is that in yours says `String.prototype.normalize()` and the other `String.normalize()` – Tiago Martins Peres Jul 21 '20 at 11:34
  • From those links still isn't very clear how the output varies based on the given arguments – Tiago Martins Peres Jul 21 '20 at 11:39
  • 1
    `String.prototype.normalize()` is correct in a technical sense, because `normalize()` is a dynamic method you call on instances, not the class itself. The point of `normalize()` is to be able to compare Strings that look the same but don't consist of the same characters, as shown in the example code on MDN. –  Jul 21 '20 at 11:59
  • 8
    Baffling that anyone could write "documentation" for `normalize()` - a function that works with Unicode strings - by demonstrating pure-ASCII strings... – Niet the Dark Absol Jul 21 '20 at 12:03

5 Answers5

6

As stated in MDN documentation, String.prototype.normalize() return the Unicode Normalized Form of the string. This because in Unicode, some characters can have different representation code.

This is the example (taken from MDN):

const name1 = '\u0041\u006d\u00e9\u006c\u0069\u0065';
const name2 = '\u0041\u006d\u0065\u0301\u006c\u0069\u0065';

console.log(`${name1}, ${name2}`);
// expected output: "Amélie, Amélie"
console.log(name1 === name2);
// expected output: false
console.log(name1.length === name2.length);
// expected output: false

const name1NFC = name1.normalize('NFC');
const name2NFC = name2.normalize('NFC');

console.log(`${name1NFC}, ${name2NFC}`);
// expected output: "Amélie, Amélie"
console.log(name1NFC === name2NFC);
// expected output: true
console.log(name1NFC.length === name2NFC.length);
// expected output: true

As you can see, the string Amélie as two different Unicode representations. With normalization, we can reduce the two forms to the same string.

Greedo
  • 3,438
  • 1
  • 13
  • 28
6

It depends on what will do with strings: often you do not need it (if you are just getting input from user, and putting it to user). But to check/search/use as key/etc. such strings, you may want a unique way to identify the same string (semantically speaking).

The main problem is that you may have two strings which are semantically the same, but with two different representations: e.g. one with a accented character [one code point], and one with a character combined with accent [one code point for character, one for combining accent]. User may not be in control on how the input text will be sent, so you may have two different user names, or two different password. But also if you mangle data, you may get different results, depending on initial string. Users do not like it.

An other problem is about unique order of combining characters. You may have an accent, and a lower tail (e.g. cedilla): you may express this with several combinations: "pure char, tail, accent", "pure char, accent, tail", "char+tail, accent", "char+accent, cedilla".

And you may have degenerate cases (especially if you type from a keyboard): you may get code points which should be removed (you may have a infinite long string which could be equivalent of few bytes.

In any case, for sorting strings, you (or your library) requires a normalized form: if you already provide the right, the lib will not need to transform it again.

So: you want that the same (semantically speaking) string has the same sequence of unicode code points.

Note: If you are doing directly on UTF-8, you should also care about special cases of UTF-8: same codepoint could be written in different ways [using more bytes]. Also this could be a security problem.

The K is often used for "searches" and similar tasks: CO2 and CO₂ will be interpreted in the same manner, but this could change the meaning of the text, so it should often used only internally, for temporary tasks, but keeping the original text.

Giacomo Catenazzi
  • 8,519
  • 2
  • 24
  • 32
4

Very beautifully explained here --> https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize

Short answer : The point is, characters are represented through a coding scheme like ascii, utf-8 , etc.,(We use mostly UTF-8). And some characters have more than one representation. So 2 string may render similarly, but their unicode may vary! So string comparrision may fail here! So we use normaize to return a single type of representation

// source from MDN

let string1 = '\u00F1';                           // ñ
let string2 = '\u006E\u0303';                     // ñ

string1 = string1.normalize('NFC');
string2 = string2.normalize('NFC');

console.log(string1 === string2);                 // true
console.log(string1.length);                      // 1
console.log(string2.length);                      // 1
Deekshith Anand
  • 2,175
  • 1
  • 21
  • 24
2

Normalization of strings isn't exclusive of JavaScript - see for instances in Python. The values valid for the arguments are defined by the Unicode (more on Unicode normalization).

When it comes to JavaScript, note that there's documentation with String.normalize() and String.prototype.normalize(). As @ChrisG mentions

String.prototype.normalize() is correct in a technical sense, because normalize() is a dynamic method you call on instances, not the class itself. The point of normalize() is to be able to compare Strings that look the same but don't consist of the same characters, as shown in the example code on MDN.

Then, when it comes to its usage, found a great example of the usage of String.normalize() that has

let s1 = 'sabiá';
let s2 = 'sabiá';

// one is in NFC, the other in NFD, so they're different
console.log(s1 == s2); // false

// with normalization, they become the same
console.log(s1.normalize('NFC') === s2.normalize('NFC')); // true

// transform string into array of codepoints
function codepoints(s) { return Array.from(s).map(c => c.codePointAt(0).toString(16)); }

// printing the codepoints you can see the difference
console.log(codepoints(s1)); // [ "73", "61", "62", "69", "e1" ]
console.log(codepoints(s2)); // [ "73", "61", "62", "69", "61", "301" ]

So while saibá e saibá in this example look the same to the human eye or even if we used console.log(), we can see that without normalization when comparing them we'd get different results. Then, by analyzing the codepoints, we see they're different.

Tiago Martins Peres
  • 14,289
  • 18
  • 86
  • 145
1

There are some great answers here already, but I wanted to throw in a practical example.

I enjoy Bible translation as a hobby. I wasn't too thrilled at the flashcard option out there in the wild in my price range (free) so I made my own. The problem is, there is more than one way to do Hebrew and Greek in Unicode to get the exact same thing. For example:

בָּא
בָּא

These should look identical on your screen, and for all practical purposes they are identical. However, the first was typed with the qamats (the little t shaped thing under it) before the dagesh (the dot in the middle of the letter) and the second was typed with the dagesh before the qamats. Now, since you're just reading this, you don't care. And your web browser doesn't care. But when my flashcards compare the two, then they aren't the same. To the code behind the scenes, it's no different than saying "center" and "centre" are the same.

Similarly, in Greek:

ἀ
ἀ

These two should look nearly identical, but the top is one Unicode character and the second one is two Unicode characters. Which one is going to end up typed in my flashcards is going to depend on which keyboard I'm sitting at.

When I'm adding flashcards, believe it or not, I don't always type in vocab lists of 100 words. That's why God gave us spreadsheets. And sometimes the places I'm importing the lists from do it one way, and sometimes they do it the other way, and sometimes they mix it. But when I'm typing, I'm not trying to memorize the order that the dagesh or quamats appear or if the accents are typed as a separate character or not. Regardless if I remember to type the dagesh first or not, I want to get the right answer, because really it's the same answer in every practical sense either way.

So I normalize the order before saving the flashcards and I normalize the order before checking it, and the result is that it doesn't matter which way I type it, it comes out right!

If you want to check out the results:

https://sthelenskungfu.com/flashcards/

You need a Google or Facebook account to log in, so it can track progress and such. As far as I know (or care) only my daughter and I currently use it.

It's free, but eternally in beta.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303