5

My std::strings are encoded in UTF-8 so the std::string < operator doesn't cut it. How could I compare 2 utf-8 encoded std::strings?

where it does not cut it is for accents, é comes after z which it should not

Thanks

jmasterx
  • 52,639
  • 96
  • 311
  • 557
  • 1
    Why doesn't the standard `op<` not "cut it"? What ordering do you want? – CB Bailey Jan 06 '11 at 02:45
  • 1
    UTF-8-encoded strings sort in the same order as the equivalent UTF-32-encoded strings. – dan04 Jan 06 '11 at 02:46
  • 2
    @Charles: I believe it doesn't "cut it" because that just performs a byte-by-byte comparison, and doesn't take into account accents, etc. – user541686 Jan 06 '11 at 02:49
  • What I mean is, if I have 2 bytes representing a character, the < operator will think this is 2 separate characters. – jmasterx Jan 06 '11 at 02:50
  • 1
    @Milo assuming you want lexicographic comparison by Unicode code point, I believe that UTF-8 is structured in such a way that lexicographic comparison of the UTF-8 bytes will give you the same result. – Laurence Gonsalves Jan 06 '11 at 02:56
  • Right. If you want a lexicographical ordering, `operator<` does "cut it". If you want a different ordering (e.g., case-insensitive), then please tell us so. – dan04 Jan 06 '11 at 03:08
  • @dan04 where it does not cut it is for accents, é comes after z which it should not. – jmasterx Jan 06 '11 at 03:39
  • 1
    @Lambert: What do you mean by doesn't take into account accents? Do you mean that "small letter e" followed by "combining acute accent" should be sorted the same as "small letter e with acute accent" or that "small letter e" short sort the same as "small letter e with acute accent". If the former then you are talking about unicode normalization, if the later then you need locale aware collation. I was asking the original question asker because it wasn't clear what he wanted to use the sort for. `operator<` is suitable for a lot of use cases. – CB Bailey Jan 06 '11 at 09:51
  • @Charles: Yes, I was referring to Unicode normalization, since I imagined that that's what he was asking about. But I guess I was wrong. :) – user541686 Jan 06 '11 at 11:33

4 Answers4

6

If you don't want a lexicographic ordering (which is what sorting the UTF-8 encoded strings lexicographically will give you), then you will need to decode your UTF-8 encoded strings into UCS-2 or UCS-4 as appropriate, and apply a suitable comparison function of your choosing.

To reiterate the point, the UTF-8 encoding mechanism is cleverly designed so that if you sort by looking at the numeric value of each 8-bit encoded byte, you will get the same result as if you first decoded the string into Unicode and compared the numeric values of each code point.

Update: Your updated question indicates that you want a more complex comparison function than purely a lexicographic sort. You will need to decode your UTF-8 strings and compare the decoded characters.

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
  • Note that the UTF-16 encoding does *not* have that feature. – dan04 Jan 06 '11 at 03:04
  • @dan04: Does not have what feature? – Martin York Jan 06 '11 at 03:28
  • 3
    Collation (sorting) and encoding are two completely separate issue unless you're treating them as byte arrays ANSI style. http://www.joelonsoftware.com/articles/Unicode.html – Eugene Yokota Jan 06 '11 at 03:39
  • 1
    Yea but how do I compare them, is there a logical way to know that é comes before f and after e ? – jmasterx Jan 06 '11 at 03:51
  • 8
    Depends on your locale. In German, ö sorts before p. In Swedish, the same letter sorts at the end of the alphabet. – dan04 Jan 06 '11 at 04:10
  • 1
    @dan04 somehow, Windows succeeds at this for any locale – jmasterx Jan 06 '11 at 04:26
  • 6
    @Milo: In many languages 'é' does not come after 'e' it sorts the same so two words starting with these two letters sort based on what follows their initial letters. In some languages some accented letters sort differently from their unaccented counterparts and some languages have digrams that sort differently than the two characters that make up them would indicate. E.g. in Czech 'e' and 'ě' sort the same but 'č' sorts after 'c' and 'ch' sorts after 'h' (IIRC). See here: http://userguide.icu-project.org/collation and here http://www.unicode.org/reports/tr10/ for more details. – CB Bailey Jan 06 '11 at 10:03
  • 1
    @Milo: you have to tell Windows what locale to use to do the sort. By default you get the one set up in your ‘Regional and Language Options’. There is not one single general-purpose accent-aware sort that everyone agrees on; every locale has its own traditions (some have more than one; eg German has a special phone-book sort). Collation is an enormously complex and tricky area. – bobince Jan 08 '11 at 14:20
6

The standard has std::locale for locale-specific things such as collation (sorting). If the environment contains LC_COLLATE=en_US.utf8 or similar, this program will sort lines as desired.

#include <algorithm>
#include <functional>
#include <iostream>
#include <iterator>
#include <locale>
#include <string>
#include <vector>
class collate_in : public std::binary_function<std::string, std::string, bool> {
  protected:
    const std::collate<char> &coll;
  public:
    collate_in(std::locale loc)
        : coll(std::use_facet<std::collate<char> >(loc)) {}
    bool operator()(const std::string &a, const std::string &b) const {
        // std::collate::compare() takes C-style string (begin, end)s and
        // returns values like strcmp or strcoll.  Compare to 0 for results
        // expected for a less<>-style comparator.
        return coll.compare(a.c_str(), a.c_str() + a.size(),
                            b.c_str(), b.c_str() + b.size()) < 0;
    }
};
int main() {
    std::vector<std::string> v;
    copy(std::istream_iterator<std::string>(std::cin),
         std::istream_iterator<std::string>(), back_inserter(v));
    // std::locale("") is the locale from the environment.  One could also
    // std::locale::global(std::locale("")) to set up this program's global
    // first, and then use locale() to get the global locale, or choose a
    // specific locale instead of using the environment's.
    sort(v.begin(), v.end(), collate_in(std::locale("")));
    copy(v.begin(), v.end(),
         std::ostream_iterator<std::string>(std::cout, "\n"));
    return 0;
}
$ cat >file
f
é
e
d
^D
$ LC_COLLATE=C ./a.out file
d
e
f
é
$ LC_COLLATE=en_US.utf8 ./a.out file
d
e
é
f

It's been brought to my attention that std::locale::operator()(a, b) exists, obviating the std::collate<>::compare(a, b) < 0 wrapper I wrote above.

#include <algorithm>
#include <iostream>
#include <iterator>
#include <locale>
#include <string>
#include <vector>
int main() {
    std::vector<std::string> v;
    copy(std::istream_iterator<std::string>(std::cin),
         std::istream_iterator<std::string>(), back_inserter(v));
    sort(v.begin(), v.end(), std::locale(""));
    copy(v.begin(), v.end(),
         std::ostream_iterator<std::string>(std::cout, "\n"));
    return 0;
}
ephemient
  • 198,619
  • 38
  • 280
  • 391
1

One option would be to use ICU collators (http://userguide.icu-project.org/collation/api) which provide a properly internationalized "compare" method that you can then use to sort.

Chromium has a small wrapper that should be easy to copy&paste/reuse

https://code.google.com/p/chromium/codesearch#chromium/src/base/i18n/string_compare.cc&sq=package:chromium&type=cs

Miguel Garcia
  • 1,029
  • 5
  • 14
1

Encoding (UTF-8, 16, etc) isn't the problem, it's whether the container itself is treating the string as Unicode string or 8-bit (ASCII or Latin-1) string that matters.

I found Is there an STL and UTF-8 friendly C++ Wrapper for ICU, or other powerful Unicode library, which could help you.

Community
  • 1
  • 1
Eugene Yokota
  • 94,654
  • 45
  • 215
  • 319