Locale based sort in Javascript, sort accented letters and other variants in a predefined way

Question

In Finnish we sort W after V (as in English) but because W is not a native Finnish letter, it is considered as a variant of V, which is sorted as it were equal to V, but in cases where the only difference between two words is that V is W, then V-version is sorted first. An example enlights the proper order:

Vatanen, Watanen, Virtanen

In Finnish V and W are collated as A and Á. Á is sorted like A, but in cases where it is the only difference, the unaccented one comes first. The same rule is for all other accented letters, but the Å, Ä and Ö are collated separately after Z.

Question: What would be the best algorithm to sort this like variants in a predefined way? ( eg. [Watanen, Vatanen, Virtanen] to [Vatanen, Watanen, Virtanen] )?

Addition: The question is relevant to extend to cover also other variants in the way they are defined in http://cldr.unicode.org/index/cldr-spec/collation-guidelines, because the technique would in a great probability be the same and the answers to this question benefit the widest possible audience and sort algorithms can be made compatible with collation rules defined in Unicode CLDR. The Unicode CLDR defines three levels of differences between letters: primary level (base letters), secondary level (accented letters) and tertiary level (character case).

I have thought some kind of array preparation like in numerical sort where we could pad all numbers with zeros to make them comparaple as strings. An example: Array [file1000.jpg, file3.jpg, file22.jpg] can be prepared to make it comparable as strings by padding with zeros this way: [file1000.jpg, file0003.jpg, file0022.jpg]. Because of preparation of array, we can sort it very fast using native Array.sort().

The target language is Javascript, which lacks support for collation based sorts, so the custom sort function have to be made self. The algorithm is preferred, but if you have also code it's worth +1.

hippietrail · Answer 1 · 2017-09-04T23:52:41.490

Since the time you originally asked this question, JavaScript is finally acquiring some decent locale support, including for collation.

Read up on the new EcmaScript 6 / Harmony features Intl and, specifically, Intl.Collator.

The documentation doesn't actually make it very clear that both modern and traditional sort orders are supported for Finnish, but I've tried and they are.

To get a collator for the traditional order you need to pass a "fancy" language code string: fi-u-co-trad. For the "reformed" sort order there is fi-u-co-reformed. This breaks down as:

fi - ISO 639 language code for Finnish.
u - enables Unicode features / options. (not well documented)
co - collation options.
trad - traditional sort order. I read about this option for Spanish and only found it works for Finnish as well by testing. (not well documented)
reformed - reformed sort order. Seems to be an antonym for 'trad'. If you specify neither trad nor reformed you will get default, which may be trad on some browsers and reformed on others.

Teh codez:

var surnames = ['Watanen', 'Vatanen', 'Virtanen'];

var traColl = new Intl.Collator('fi-u-co-trad');
var refColl = new Intl.Collator('fi-u-co-reformed');
var defColl = new Intl.Collator('fi');

console.log('traditional:', traColl.resolved.requestedLocale + ' -> ' + traColl.resolved.collation, surnames.sort(function (a, b) {
  return traColl.compare(a,b);
}));

console.log('reformed:', refColl.resolved.requestedLocale + ' -> ' + refColl.resolved.collation, surnames.sort(function (a, b) {
  return refColl.compare(a,b);
}));

console.log('default:', defColl.resolved.requestedLocale + ' -> ' + defColl.resolved.collation, surnames.sort(function (a, b) {
  return defColl.compare(a,b);
}));

Outputs:

traditional: fi-u-co-trad -> trad ["Vatanen", "Watanen", "Virtanen"]
reformed: fi-u-co-reformed -> reformed ["Vatanen", "Virtanen", "Watanen"]
default: fi -> default ["Vatanen", "Virtanen", "Watanen"]

Tested in Google Chrome, which, from what I read online, is lagging behind Firefox in this stuff.

Thanks a lot! Quickly made a test-bed: http://jsbin.com/kijiji/3/. It seems that (as you said) this works in Chrome, but FF doesn't support 'fi-u-co-trad' (it is the same as 'fi' of Chrome) and IE11 shows 'fi' as 'fi-u-co-trad' of Chrome. Conclusion: using 'fi' produces different sort orders in different browsers, which means that browser/feature sniffing or external library or server side library is needed to ensure consistent sorting in browsers. Of course the execution speed may be an issue when using other than native solutions. — Timo Kähkönen, Apr 26 '15 at 02:13
I would highly recommend filing bug reports against the behaviour in the other browsers. Or bring it to the attention of some forum or mailing list used by the people organizing ES6 Harmony, especially if they have a localization subgroup. — hippietrail, Apr 26 '15 at 02:41
@Timo I've found out there is also an "antonym" to `trad` called `reformed`. I wouldn't be surprised if some browsers interpret this three-way choice in different ways: http://unicode.org/reports/tr35/#Locale_Extension_Key_and_Type_Data — hippietrail, Apr 27 '15 at 11:33

score 10 · Answer 2 · answered Mar 21 '17 at 12:57

I had this problem today and came across String.prototype.localeCompare. You can use it with arr.sort() and specify a locale:

var names = ['Andrea', 'Ándrea', 'Àndrea', 'Äiti', 'Özmir', 'åke', 'Zorro', 'Åke'];

// Undesired order:
names.sort()
console.log('default sort', names);

// Desired order:
names.sort((nameA, nameB) => nameA.localeCompare(nameB, 'fi') > 0);
console.log('locale sort', names);

// Or since positive values are truthy, you can omit the `> 0`:
names.sort((nameA, nameB) => nameA.localeCompare(nameB, 'fi'));

// You can also control whether upper or lower case should sort first:
names.sort((nameA, nameB) => nameA.localeCompare(nameB, 'fi', {
  caseFirst: 'upper'
}));
console.log('locale sort with caseFirst option', names);

Looks like the caseFirst option works in Chrome but not in Firefox at the moment.

The MDN page contains more info and available options. localeCompare seems to work fine with sorting Finnish strings, and I suppose it works with many other locales as well.

rici · Accepted Answer · 2014-06-20T15:29:24.250

The usual approach to this problem is to use a list of mappings (normally the list doesn't need to be longer than three, and in your case two would do.) Each mapping maps a character onto a sequence point. [Note 3] So in your example,

 primary:      secondary:
  A -> 0         A -> 0
  Á -> 0         Á -> 1
  B -> 1         (irrelevant)
  C -> 2
  D -> 3
  E -> 4
...
  T -> 20
  U -> 21
  V -> 22        V -> 0
  W -> 22        W -> 1
  X -> 23
...

The comparison algorithm essentially first translates each character in the words to using mapping1, and if they are not the same, uses that as the comparison. If they are the same, then it repeats using mapping2 (and so on).

Not all languages are so simple, so there are a bunch of variations (for example, you might reverse the strings in pass 2).

Note that you can achieve the same effect by making comparison keys consisting of the concatenation of the translations. If you do a lot of comparisons, caching this key can be a win. In that case, you would use a special value in the mappings other than the first mapping for "irrelevant". All irrelevant codes can be omitted, which often shortens the comparison key quite a bit.

For example, in your example (but just upper case because it would be tedious to type the whole mapping sequence), we would translate VATANEN using the first mapping to [22, 1, 20, 1, 15, 5, 15] and with the second mapping to [0, 0, --, 0, --, --, --]. WATANEN would be [22, 1, 20, 1, 15, 5, 15] (exactly the same) with the first mapping, and [1, 0, --, 0, --, --, --] with the second. So dropping the --'s [Note 1], the comparison keys would be:

VATANEN:  [22, 1, 20, 1, 15, 5, 15, 0, 0, 0]
VÁTANEN:  [22, 1, 20, 1, 15, 5, 15, 0, 1, 0] (if there were such a place)
WATANEN:  [22, 1, 20, 1, 15, 5, 15, 1, 0, 0]
VIRTANEN: [22, 9, ...]

This can be extended to more than two translation tables.

For example, many applications want to do something like case-insensitive sorting, but where character case makes a difference if there are no other differences (in English, this usually means putting words with upper-case letter before words which are all lower-case, but both options are plausible.)

So in the Finnish case, we could add a third translation table, where all upper case letters are translated to 0, all lower case letters are translated to 1, and all other characters are not translated. Some concatenated translations:

           -------primary---------  --2ary-  ------tertiary-----
VÁTANEN:  [22, 1, 20, 1, 15, 5, 15, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
Vátenen:  [22, 1, 20, 1, 15, 5, 15, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1]
WATANEN:  [22, 1, 20, 1, 15, 5, 15, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

It is not at all obvious that this order is "correct". Nor, indeed, is it obvious what "correct" means for most languages, aside from those which have official linguistic authorities. [Note 2] So the above should just be considered as an example of multi-level encoding, not a definitive guide to alphabetic order. In this case, the tertiary code consists of just a single bit, although there may still be languages (like Dutch) in which there are three cases for a few letters.

The above scheme does not contemplate digraphs and trigraphs, although they are reasonably easy to add, with care. (In the primary ordering, and possibly in secondary and tertiary as well, the digraph needs to have a single code for both characters.) Spanish, contrary to popular belief amongst non-Spanish programmers, has not been an example of this since 1994, almost twenty years ago, when the RAE decreed that 'ch' is alphabetized between 'cg' and 'ci', and not between 'c' and 'd' as it formerly was. I believe that some Dutch speakers still expect to find 'ij' and 'y' together, and Hungarians may still respect the complicated collection of digraphs and trigraphs that comprise their alphabet, but on the whole complicated mechanical schemes for alphabetical ordering are dying out, to be replaced with simple latin ordering, possibly complemented with secondary ordering for diacritics (French, Spanish, apparently Finnish, German dictionaries but not telephone books) or primary ordering of diacritics (Spanish ñ, Danish/Norwegian/Swedish vowels, Turkish).

[Note 1]: It's not necessary to insert "irrelevant" secondary codes because the secondary part of the encoding is only consulted for pairs of words where the primary parts are identical. Since any letter considered irrelevant for the secondary coding would be so considered in all words in a primary equivalence class, it can just be omitted from the secondary coding. Similarly, it is legitimate to reuse codes in different primary equivalence classes, as we do above: [v, w] are [0, 1] and so are [a, á]. Obviously, there is no possibility of ambiguity. Consequently, secondary encodings can be quite short, both in sequence length and bit length.

[Note 2]: English has no such body; the Spanish one is the Real Academia Española, but I couldn't find precise collation rules in any of the RAE's publications on my bookshelf, aside from the laconic observation that accents aren't considered in alphabetical order. However, the RAE's dictionary seems to consistently put unaccented words before any accented word with the same letters, at least in the two cases I could think of -- papa/papá and sabana/sábana.

[Note 3] Of course, we need to keep track of the originals as well, so we have to somehow attach the comparison keys to the strings. As long as no two characters have the same translation in all of the mappings, this can be done with a simple hash table using the comparison key as a key.

Thanks for this! I interested in that idea of "comparison keys consisting of the concatenation of the translations"? Could you explain it more? — Timo Kähkönen, Sep 27 '12 at 16:14
This sounds good, but I don't fully understand the idea of comparison keys. What would be the comparison key for WATWANEN or WATVANEN? — Timo Kähkönen, Sep 27 '12 at 18:03
Is this "comparison key" functionality same as "sort key" in http://userguide.icu-project.org/collation/architecture (Please see section Sort keys)? — Timo Kähkönen, Sep 27 '12 at 18:25
@Timo, Yes, exactly. That looks like a good explanation of the concept, too. — rici, Sep 27 '12 at 19:03
But I don't understand why you have those "0,0,0]", "0,1,0]" and "1,0,0]" at the end of the sort keys? Should they be after the letter in question? This way: VATANEN: [22, 0, 0, 0, 1, 20, 1, 15, 5, 15]? — Timo Kähkönen, Sep 27 '12 at 19:15
Look at the paragraph before that. The sort key is the translation by the primary mapping concatenated with the translation by the secondary mapping. I just dropped all the -- entries in the secondary mapping, which is legitimate because it will only get examined for a pair of strings if those strings match perfectly on the primary mapping. The same logic allows me to recycle secondary mappings between different equivalence classes so that I can use {0, 1} for secondary {V, W} as well as for {A, Á}. Try it on paper :) — rici, Sep 27 '12 at 19:32
What would be the sort keys of WATWANEN and WATVANEN? The algorithm should perform also cases where there are more than one variant characters. — Timo Kähkönen, Sep 27 '12 at 20:07
[22, 1, 20, 22, 1, 15, 5, 15, 1, 0, 1, 0] and [22, 1, 20, 22, 1, 15, 5, 15, 1, 0, 0, 0] (15 is N) — rici, Sep 27 '12 at 20:09
And to make these comparable as strings strict to use in Array.sort(), the comparable strings would be "22,01,20,22,01,15,05,15,01,00,01,00" and "22,01,20,22,01,15,05,15,01,00,00,00"? — Timo Kähkönen, Sep 27 '12 at 20:16
That, or any suitable more compact encoding. You could do a lot better. — rici, Sep 27 '12 at 20:30
I tested these and the sort order was right. So the primary (base letter) difference and secondary (accent) difference has a solution. But how about tertiary (case) difference, between "a" and "A"? — Timo Kähkönen, Sep 27 '12 at 20:40
Could you extend your answer to cover also tertiary (case) level difference? — Timo Kähkönen, Sep 28 '12 at 16:39

Bergi · Answer 4 · 2012-09-27T16:57:32.220

I think this should do it:

var variants = ["AÁÀ", "VW", … ];

// Build a map that links variants with their base letter (for quick access)
var map = {}, chars = "";
for (var i=0; i<variants.length; i++) {
    var variant = variants[i], char = variant.charAt(0);
    for (var j=1; j<variants[i].length; j++)
        map[variant.charAt(j)] = char;
    chars += variant.substr(1);
}
// and a simple regular expression, containing a character class of all variant chars
var regex = new RegExp("["+chars+"]","g");

function sortFinnish(arr) {
    // each word is replaced by an array [literal],
    // containing 0) the word 1) the normalized word
    for (var i=0; i<arr.length; i++)
        arr[i] = [ arr[i], arr[i].replace(regex, function(m) {
            // every variant character is replaced by its base letter
            return map[m];
        }) ];
    // then sort that array with a custom compare function:
    arr.sort(function(a, b) {
        // at first by the normalized words,
        // i.e. variants count the same as their bases
        if (b[1] > a[1]) return -1;
        if (b[1] < a[1]) return 1;
        // else the normalized words are the same
        // - return a comparsion of the actual words
        if (b[0] > a[0]) return -1;
        if (b[0] < a[0]) return 1;
        return 0;
    });
    // after that, replace each of the arrays with the actual word again
    for (var i=0; i<arr.length; i++)
        arr[i] = arr[i][0];
    return arr;
}

@performance: Okay, I found a way to use .sort() without a custom compare function, which could be even a bit faster [in some environments] according to http://jsperf.com/sort-mapped-strings. The trick is to use objects with a .toString() method that returns the string-to-be-sorted-by:

    function SortString(actualvalue) {
        this.val = actualvalue;
        // the value-to-sort-by is a normalized version, concatenated by a space
        // with the actual value so that the actual value is compared when the
        // normalized ones are the same.
        // ! does not work with values that contain spaces !
        // we'd need to use something like \u0001 instead
        var sortval = actualvalue.replace(regex, function(m) {
            // every variant character is replaced by its base letter
            return map[m];
        }) + " " + actualvalue;
        this.toString = function(){ return sortval; };
    }
    for (var i=0; i<arr.length; i++)
        arr[i] = new SortString(arr[i]);
    // when comparing, the sortstring is used as the object's representation:
    arr.sort();
    // after that, replace the objects with the actual words again:
    for (var i=0; i<arr.length; i++)
        arr[i] = arr[i].val;

Thanks for quick response and code! Not tested it yet, but could you explain the algorithm in plain English eg. adding the algorithm before or after your code? — Timo Kähkönen, Sep 27 '12 at 15:44
This is not a algorithm (yet), and I appreciate more algorithm proposals. Seems that there is very much code in the actual sort()-function. Is it possible to prepare array beforehand so that we can rely entirely on native sort()-function and it's speed? — Timo Kähkönen, Sep 27 '12 at 16:10
Isn't the mapping algorithm enough? And yes, we do prepare the array beforehand, but we can't use `sort()` without a custom function or otherwise the actual words were lost. And there is really not much code in the compare function, actually it does nearly the same as the native approach. — Bergi, Sep 27 '12 at 16:19
Memory is cheap, processing power and time are expensive. So is it possible to convert array to map which have both the original and the sort key and the sort is done by sort key? The original values are then preserved. In numerical comparison example in my question, the array is modified for sorting purposes, and in that case we have to use some sort of map or object. — Timo Kähkönen, Sep 27 '12 at 16:34
Yes, only that not all mappings can be mapped back to the original value, so you need to pass that somehow along with the sort-string. — Bergi, Sep 27 '12 at 16:46
I tried the first of code examples: http://jsfiddle.net/Zz3GC/ . Otherwise the order is as expected, but Á (Aacute) should be before À (Abreve). At least they are in this order in variants-array. — Timo Kähkönen, Sep 28 '12 at 00:00
The `variants` array defines no order. It just lists groups of chars that are considered to be the same, and are sorted like the first character in the string. I might should have used an object instead of array and string, which was misleading. To sort `À` after `Á` (which has a higher unicode code point) you would need an additional mapping, or a [comparison by your own alphabet](http://stackoverflow.com/q/12606722/1048572). — Bergi, Sep 28 '12 at 00:54

score 0 · Answer 5 · answered Sep 29 '12 at 12:10

The standard way to multilevel sorting is expressed here: http://unicode.org/reports/tr10/

The principle is to use locale based tailorings to override the order expressed in the Default Unicode Collation Element Table (DUCET, http://www.unicode.org/Public/UCA/latest/allkeys.txt ). DUCET is the base sort order of characters. Tailorings are needed if locale has some special rules that cannot or are not performance-wise to implement in DUCET.

The directory core/common/collation/ in http://unicode.org/Public/cldr/22/core.zip has 87 xml-files. An example of Finnish tailorings in fi.xml:

<collation type="standard" >
    <rules>
    <!-- SNIP -->
        <reset>V</reset>
            <s>w</s>
            <t>W</t>
    <!-- SNIP -->
    </rules>
</collation>

Locale-based sort is fairly tedious to implement, and in order to be fast enough, it is required to use of resources at the lowest possible (machine) level, so I'm coming round to the view that it is best to wait until Javascript supports it natively.

But may be waiting never ends: Javascript lacks still support for numerical sort, which should be very easy to implement at machine level.

If someone coder has enough motivation to implement locale based sort in Javascript, I'll be glad to see results and support it at my side.

Locale based sort in Javascript, sort accented letters and other variants in a predefined way

5 Answers5

Linked