Lexicographical sorting for non-ascii characters

Question

I have done lexicographical sorting for ascii characters by the following code:

std::ifstream infile;
std::string line, new_line;
std::vector<std::string> v;
while(std::getline(infile, line))
            {
                // If line is empty, ignore it
                if(line.empty())
                    continue;
                new_line = line + "\n";
                // Line contains string of length > 0 then save it in vector
                if(new_line.size() > 0)
                    v.push_back(new_line);
            }   
sort(v.begin(), v.end());

The result should be: a aahr abyutrw bb bhehjr cgh cuttrew ....

But I don't know how to do Lexicographical sorting for both ascii and non-ascii characters in the order like this: a A À Á Ã brg Baq ckrwg CkfgF d Dgrn... Please tell me how to write code for it. Thank you!

First you have to decide how you want to sort them. Does ‍♀️ come lexicographically before or after ☃? Write a function that carries out your decision. — Raymond Chen, Nov 16 '19 at 01:41
@RaymondChen I want to make ascii comes before non-ascii characters. Non-ascii characters should follow the rule such as: `A, 'A, ^A, ~A... — Hector Ta, Nov 16 '19 at 01:46
In general, a `std::map` could be your solution (mapping the characters to the order index you intend). Then `std::sort` can be used with a custom predicate considering the order index for comparison. However, "`A" is not a character, these are two. Or did you mean "À"? Please, [edit] your question to clarify. (Give an example of the order, you prefer, involving non-ASCIIs.) — Scheff's Cat, Nov 16 '19 at 06:44
If you want to sort according to some specific locale, then you should use that locale for comparison. You need to know the encoding, the locale (language) and the operating system. The default locale "C" is usually only appropriate when you want to sort data for internal purpose (find data) but not to display data to user in alphabetical order of his own language except if data is limited to ascii characters (typically emails, url, postal code, identifier in some programming languages, etc would use only ascii characters) — Phil1970, Nov 16 '19 at 14:17

Scheff's Cat · Answer 1 · 2019-11-16T15:03:20.067

The OP didn't but I find it worth to mention: Speaking about non-ASCII characters, the encoding should be considered as well.

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Characters like À, Á, and Â are not part of the 7 bit ASCII but were considered in a variety of 8 bit encodings like e.g. Windows 1252. Thereby, it's not granted that a certain character (which is not part of ASCII) has the same code point (i.e. number) in any encoding. (Most of the characters have no number in most encodings.)

However, a unique encoding table is provided by the Unicode containing all characters of any other encoding (I believe). There are implementations as

UTF-8 where code points are represented by 1 or more 8 bit values (storage with char)
UTF-16 where code points are represented with 1 or 2 16 bit values (storage with std::char16_t or, maybe, wchar_t)
UTF-32 where code points are represented with 1 32 bit value (storage with std::char32_t or, maybe, wchar_t if it has sufficient size).

Concerning the size of wchar_t: Character types.

Having that said, I used wchar_t and std::wstring in my sample to make the usage of umlauts locale and platform independent.

The order used in std::sort() to sort a range of T elements is defined by default with
bool < operator(const T&, const T&) the < operator for T.
However, there are flavors of std::sort() to define a custom predicate instead.

The custom predicate must match the signature and must provide a strict weak ordering relation.

Hence, my recommendation to use a std::map which maps the charactes to an index which results in the intended order.

This is the predicate, I used in my sample:

  // sort words
  auto charIndex = [&mapChars](wchar_t chr)
  {
    const CharMap::const_iterator iter = mapChars.find(chr);
    return iter != mapChars.end()
      ? iter->second
      : (CharMap::mapped_type)mapChars.size();
  };

  auto pred
    = [&mapChars, &charIndex](const std::wstring &word1, const std::wstring &word2)
  {
    const size_t len = std::min(word1.size(), word2.size());
    // + 1 to include zero terminator
    for (size_t i = 0; i < len; ++i) {
      const wchar_t chr1 = word1[i], chr2 = word2[i];
      const unsigned i1 = charIndex(chr1), i2 = charIndex(chr2);
      if (i1 != i2) return i1 < i2;
    }
    return word1.size() < word2.size();
  };

  std::sort(words.begin(), words.end(), pred);

From bottom to top:

std::sort(words.begin(), words.end(), pred); is called with a third parameter which provides the predicate pred for my customized order.
The lambda pred(), compares two std::wstrings character by character. Thereby, the comparison is done using a std::map mapChars which maps wchar_t to unsigned i.e. a character to its rank in my order.
The mapChars stores only a selection of all character values. Hence, the character in quest might not be found in the mapChars. To handle this, a helper lambda charIndex() is used which returns mapChars.size() in this case – which is granted to be higher than all occurring indices.

The type CharMap is simply a typedef:

typedef std::map<wchar_t, unsigned> CharMap;

To initialize a CharMap, a function is used:

CharMap makeCharMap(const wchar_t *table[], size_t size)
{
  CharMap mapChars;
  unsigned rank = 0;
  for (const wchar_t **chars = table; chars != table + size; ++chars) {
    for (const wchar_t *chr = *chars; *chr; ++chr) mapChars[*chr] = rank;
    ++rank;
  }
  return mapChars;
}

It has to be called with an array of strings which contains all groups of characters in the intended order:

const wchar_t *table[] = {
  L"aA", L"äÄ", L"bB", L"cC", L"dD", L"eE", L"fF", L"gG", L"hH", L"iI", L"jJ", L"kK", L"lL", L"mM", L"nN",
  L"oO", L"öÖ", L"pP", L"qQ", L"rR", L"sS", L"tT", L"uU", L"üÜ", L"vV", L"wW", L"xX", L"yY", L"zZ"
};

The complete sample:

#include <string>
#include <sstream>
#include <vector>

static const wchar_t *table[] = {
  L"aA", L"äÄ", L"bB", L"cC", L"dD", L"eE", L"fF", L"gG", L"hH", L"iI", L"jJ", L"kK", L"lL", L"mM", L"nN",
  L"oO", L"öÖ", L"pP", L"qQ", L"rR", L"sS", L"tT", L"uU", L"üÜ", L"vV", L"wW", L"xX", L"yY", L"zZ"
};

static const wchar_t *tableGerman[] = {
  L"aAäÄ", L"bB", L"cC", L"dD", L"eE", L"fF", L"gG", L"hH", L"iI", L"jJ", L"kK", L"lL", L"mM", L"nN",
  L"oOöÖ", L"pP", L"qQ", L"rR", L"sS", L"tT", L"uUüÜ", L"vV", L"wW", L"xX", L"yY", L"zZ"
};

typedef std::map<wchar_t, unsigned> CharMap;

// fill a look-up table to map characters to the corresponding rank
CharMap makeCharMap(const wchar_t *table[], size_t size)
{
  CharMap mapChars;
  unsigned rank = 0;
  for (const wchar_t **chars = table; chars != table + size; ++chars) {
    for (const wchar_t *chr = *chars; *chr; ++chr) mapChars[*chr] = rank;
    ++rank;
  }
  return mapChars;
}

// conversion to UTF-8 found in https://stackoverflow.com/a/7561991/7478597
// needed to print to console
// Please, note: std::codecvt_utf8() is deprecated in C++17. :-(
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8_conv;

// collect words and sort accoring to table
void printWordsSorted(
  const std::wstring &text, const wchar_t *table[], const size_t size)
{
  // make look-up table
  const CharMap mapChars = makeCharMap(table, size);
  // strip punctuation and other noise
  std::wstring textClean;
  for (const wchar_t chr : text) {
    if (chr == ' ' || mapChars.find(chr) != mapChars.end()) {
      textClean += chr;
    }
  }
  // fill word list with sample text
  std::vector<std::wstring> words;
  for (std::wistringstream in(textClean);;) {
    std::wstring word;
    if (!(in >> word)) break; // bail out
    // store word
    words.push_back(word);
  }
  // sort words
  auto charIndex = [&mapChars](wchar_t chr)
  {
    const CharMap::const_iterator iter = mapChars.find(chr);
    return iter != mapChars.end()
      ? iter->second
      : (CharMap::mapped_type)mapChars.size();
  };
  auto pred
    = [&mapChars, &charIndex](const std::wstring &word1, const std::wstring &word2)
  {
    const size_t len = std::min(word1.size(), word2.size());
    // + 1 to include zero terminator
    for (size_t i = 0; i < len; ++i) {
      const wchar_t chr1 = word1[i], chr2 = word2[i];
      const unsigned i1 = charIndex(chr1), i2 = charIndex(chr2);
      if (i1 != i2) return i1 < i2;
    }
    return word1.size() < word2.size();
  };
  std::sort(words.begin(), words.end(), pred);
  // remove duplicates
  std::vector<std::wstring>::iterator last = std::unique(words.begin(), words.end());
  words.erase(last, words.end());
  // print result
  for (const std::wstring &word : words) {
    std::cout << utf8_conv.to_bytes(word) << '\n';
  }
}

template<typename T, size_t N>
size_t size(const T (&arr)[N]) { return sizeof arr / sizeof *arr; }

int main()
{
  // a sample string
  std::wstring sampleText
    = L"In the German language the ä (a umlaut), ö (o umlaut) and ü (u umlaut)"
      L" have the same lexicographical rank as their counterparts a, o, and u.\n";
  std::cout << "Sample text:\n"
    << utf8_conv.to_bytes(sampleText) << '\n';
  // sort like requested by OP
  std::cout << "Words of text sorted as requested by OP:\n";
  printWordsSorted(sampleText, table, size(table));
  // sort like correct in German
  std::cout << "Words of text sorted as usual in German language:\n";
  printWordsSorted(sampleText, tableGerman, size(tableGerman));
}

Output:

Words of text sorted as requested by OP:
a
and
as
ä
counterparts
German
have
In
language
lexicographical
o
ö
rank
same
the
their
u
umlaut
ü
Words of text sorted as usual in German language:
ä
a
and
as
counterparts
German
have
In
language
lexicographical
o
ö
rank
same
the
their
u
ü
umlaut

Live Demo on coliru

Note:

My original intention was to do the output with std::wcout. This didn't work correctly for ä, ö, ü. Hence, I looked up a simple way to convert wstrings to UTF-8. I already knew that UTF-8 is supported in coliru.

@Phil1970 reminded me that I forgot to mention something else:

Sorting of strings (according to “human dictionary” order) is usually provided by std::locale. std::collate provides a locale dependent lexicographical ordering of strings.

The locale plays a role because the order of characters might vary with distinct locales. The std::collate doc. has a nice example for this:

Default locale collation order: Zebra ar förnamn zebra ängel år ögrupp
English locale collation order: ängel ar år förnamn ögrupp zebra Zebra
Swedish locale collation order: ar förnamn zebra Zebra år ängel ögrupp

Conversion of UTF-16 ⇔ UTF-32 ⇔ UTF-8 can be achieved by mere bit-arithmetics. For conversion to/from any other encoding (ASCII excluded which is a subset of Unicode), I would recommend a library like e.g. libiconv.

Seriously, **you don't want to define your own sort function**! It does not handle all languages and one can easily forget some less known rules or characters. One should almost always prefer a sort from a library. For ex. how want should handle oe, œ, oé, ôé, ôe, oè etc. In French, for example, the comparison without diacritic is done forward and then the diacritic are compared in backward order. Also rules might get improved over time (for ex. sort in Windows file explorer now know hot to sort number like **picture 9** before **picture 10**). Otherwise, lot of useful think in that answer. — Phil1970, Nov 16 '19 at 14:40
@Phil1970 You're correct. I forgot that I wanted to mention `std::locale::collate` as well. Sorting of numbers (e.g. in Explorer) has always amused me. Actually, I don't need this. If I want to have files sorted I give it resp. names (with leading 0s in case). ;-) — Scheff's Cat, Nov 16 '19 at 14:47

Lexicographical sorting for non-ascii characters

1 Answers1