How to compare words letter by letter in a Unicode string?

Question

I wrote a library to create a crosswords grid, and it works fine (at least as defined) for English words.

However, when I use, for example, Portuguese words like s1 = 'milhão' and s2 = 'sã', if I use 'std::string' the function that tries to find an intersection between s1 and s2 fails. I understood why, as 'ã' is encoded in 2 bytes so the comparison between 's1[4]' and 's2[1]' fails.

If I use 'std::u16string' or 'std::wstring' the function works.

How can I safely compare strings letter by letter, without knowing if the letter is encoded in a single byte or a multi-byte? Should I always use 'std::u32string' if I want my programs to be ready to be used world wide?

The truth is that I never had to worry about localization in my programs, so I am kind of confused.

Here is a program to illustrate my problem:

#include <cstdint>
#include <iostream>
#include <string>

void using_u16() {
  std::u16string _str1(u"milhão");
  std::u16string _str2(u"sã");

  auto _size1{_str1.size()};
  auto _size2{_str2.size()};

  for (decltype(_size2) _i2 = 0; (_i2 < _size2); ++_i2) {
    for (decltype(_size1) _i1 = 0; (_i1 < _size1); ++_i1) {
      if (_str1[_i1] == _str2[_i2]) {
        std::wcout << L"1 - 'milhão' met 'sã' in " << _i1 << ',' << _i2
                   << std::endl;
      }
    }
  }
}

void using_wstring() {
  std::wstring _str1(L"milhão");
  std::wstring _str2(L"sã");

  auto _size1{_str1.size()};
  auto _size2{_str2.size()};

  for (decltype(_size2) _i2 = 0; (_i2 < _size2); ++_i2) {
    for (decltype(_size1) _i1 = 0; (_i1 < _size1); ++_i1) {
      if (_str1[_i1] == _str2[_i2]) {
        std::wcout << L"2 - 'milhão' met 'sã' in " << _i1 << ',' << _i2
                   << std::endl;
      }
    }
  }
}

void using_string() {
  std::string _str1("milhão");
  std::string _str2("sã");

  auto _size1{_str1.size()};
  auto _size2{_str2.size()};

  for (decltype(_size2) _i2 = 0; (_i2 < _size2); ++_i2) {
    for (decltype(_size1) _i1 = 0; (_i1 < _size1); ++_i1) {
      if (_str1[_i1] == _str2[_i2]) {
        std::cout << "3 - 'milhão' met 'sã' in " << _i1 << ',' << _i2
                  << std::endl;
      }
    }
  }
}
int main() {
  using_u16();
  using_wstring();
  using_string();  

  return 0;
}

As I explained, when calling 'using_string()' nothing is printed.

You first need to define what you consider "a letter", which is non-trivial in Unicode. And then the result will likely be that it doesn't match with a single unicode code point (i.e. a `char32_t`) either, regardless of how it is encoded. You'll then need a proper Unicode support library like ICU to handle this. — user17732522, Aug 21 '23 at 00:07
In particular `ã` may be either one or two unicode code points (U+00E3 or U+0061 followed by U+0303) and each then may be one or more code units depending on the encoding. — user17732522, Aug 21 '23 at 00:10

Stuntman11 · Accepted Answer · 2023-08-21T10:45:45.913

Depending on how you define a character the requirements for string comparison change.

You could define a character as a specific code point. Many special characters can be represented as a single code point. In this case std::u32string and char32_t are a good fit for your problem. The Rust Language also does this with their chars() iterator, where all char are 4 byte code points (Rust Docs). With the addition of the UTF32 literals in C++11 and simple conversion between UTF8 and UTF32 you have all the necessary tools!

But sometimes the character representations need multiple code points. Some characters even use ambiguous definition, having multiple sequences for the same character. In that case you need more logic behind the comparison and grapheme clusters group code points with logical connection. For example an e followed by an acute accent modifier is grouped optically into a single é. For characters with respectively only single or multi code point that would solve your problem because you can compare the graphemes. For the ambiguous characters with both single and multi code point representations you need a simplification that converts multi code point to single code point if a suitable representation exists. This procedure is called Unicode Normalization and provides a way to stabilize your characters.

Here a demonstration of the concept in rust with the unicode_normalization crate:

fn main() {
    let single_cp = "\u{E9}"; //é
    let multi_cp = "\u{65}\u{301}"; //é
    println!("== RAW ==");
    println!("Printed     : {} {}", single_cp, multi_cp);
    println!("Bytes       : {} {}", single_cp.bytes().len(), multi_cp.bytes().len());
    println!("Code Points : {} {}", single_cp.chars().count(), multi_cp.chars().count());

    let single_cp_norm = single_cp.nfc().to_string();
    let multi_cp_norm = multi_cp.nfc().to_string();
    println!("== NORMALIZED ==");
    println!("Printed     : {} {}", single_cp_norm, multi_cp_norm);
    println!("Bytes       : {} {}", single_cp_norm.bytes().len(), multi_cp_norm.bytes().len());
    println!("Code Points : {} {}", single_cp_norm.chars().count(), multi_cp_norm.chars().count());
}

== RAW ==
Printed     : é é
Bytes       : 2 3
Code Points : 1 2
== NORMALIZED ==
Printed     : é é
Bytes       : 2 2
Code Points : 1 1

The code analyses the single code point (left) and multi code point (right) representation of an optically identical character. In the RAW part you can clearly see that byte and code point count are different even though the are printed the same way. So a byte by byte comparison with std::string and a code point comparison with std::u32string are both ineffective. In the NORMALIZED part the multi code point representation was converted to a single code point so both are equivalent, indicated by the same values for byte and code point count. After the normalization the std::u32string approach would work correctly in all cases where simplification to single code point is possible.

To also accommodate for characters with strictly more than one code point, you can do a normalization first, followed by an equality check based on the grapheme clusters. This way the ambiguous representations collapse into a unified form and the remaining multi code point sequences can be compared. This is probably overengineered for your specific use case! A unicode normalization and equality check based on the simplified code points should be sufficient.

I don't have any experience myself on how an implementation in c++ would look like, but in this stack overflow thread regarding unicode normalization the light weight libraries utfcpp for C++ and utf8proc for C were recommended. There is also a massive library called ICU providing various unicode operations including logical character reading with the BreakIterator and normalization with the Normalizer.

After this post you may realize that unicode and localization are quite complex topics. They are far from beeing solved with only std::u32string in the mix. In the end you make the assumptions on character source and stability and decide how capable your cross-word library should be in handling these cases.

Thanks to @user17732522's feedback I improved the answer!

A unicode code point is not going to correspond to what OP considers "a letter" in general, for example `ã` could be two code points, i.e. two `char32_t`. — user17732522, Aug 21 '23 at 00:06
@user17732522 yes thanks for the feedback. I didn't even think about this scenario ... I have updated my answer to clarify the issue! — Stuntman11, Aug 21 '23 at 00:40
"grapheme cluster" is too much for a cross-word ("*ffi*" may be a grapheme cluster). OTOH (as Unicode and SIL) tell you, there is no good terminology ("good" as unique and understandable" also standards mix terms). — Giacomo Catenazzi, Aug 21 '23 at 09:03
@GiacomoCatenazzi I totally agree. The grapheme cluster section was meant as a perspective on an extreme approach. I updated my answer to point out the overengineering for the op's specific use case. — Stuntman11, Aug 21 '23 at 10:53

score 1 · Answer 2 · answered Aug 21 '23 at 04:36

Although it's not guaranteed to fix every possible problem, you typically want to do a couple of things. First of all, your idea of using a u32string is a pretty good start.

Second, you typically want to do some form of normalization. As you've seen, Unicode allows many characters that include a grave, accent, umlaut (etc.) to be encoded in either of two separate ways: one is a single code for for something like "a with umlaut". The other is two separate code points, one for "a" and the other for "combining umlaut". Normalization is converting all of those to one form or the other, so regardless of how they started, you end up with them represented the same way.

Unicode Normalization Forms

For the task at hand, you probably want the "NFC" normalization, which is canonical decomposition followed by canonical composition. This will result in a character being represented by a single code point when possible, which tends to help make the comparison relatively easy.

But it's still only relatively easy. Depending on what else (if anything) you need to do with your Unicode, you may want to consider using a library for the manipulation.

Unfortunately, the primary library for this ICU, which is sort of equal parts wonderful and terrible. On one hand, it provides lots of capabilities, and can do what you're asking for (among many other things, some of them much more complex). On the other hand, it's written in a C++ style that most of us haven't used since the 1990's or so. And not even the late 1990's either. So although I suppose I'd use it if I needed the sort of stuff it provides, I'd probably grit my teeth every time I needed to touch it.

ICU

Another library that should probably suffice for your purposes, and makes much better use of reasonably modern C++ is named Ogonek. Unfortunately, its author seems to have lost interest--it hasn't been updated in around a decade, so chances of bug-fixes, improved documentation, etc., are minimal even at best. Even so, it should support what you're asking about, and it would be my preferred choice for the (more limited) set of capabilities it provides.

Ogonek

score 1 · Answer 3 · answered Aug 21 '23 at 09:18

Warning: do not overgeneralize: languages are different, and also cross-words. If you apply rules of one language, it would look like weird in other languages (or some other countries).

Contrary to some answers and comments, I would not use UTF-32 or in general the concept of codepoints (aka characters in this context), but I would keep letters as strings: it gives you more flexibility.

As other noted, you must always normalize: you should decide in which normalization. Personally I would prefer a "decomposed" normalization in such case, so the first codepoints should be also a normal letter (for hints, errors, etc.). Some tools just let you to write N or A, and it would write in cross-word the correct accented word.

Then, per language, you will decide you decomposition: from the word to components parts, and aliases. On some language ll is considered one single entity (but l can be used separately, and worst when ll may be seen as two different characters on some cases). Some language will ignore accents on cross-words (e.g. Italian, and there is no standard way to write accents in Italian). German Eszet should be handled as character or as two SS using two cells? So I think we should not handle characters but the possible content of cells.

How to compare words letter by letter in a Unicode string?

3 Answers3