21

Given string foo, I've written answers on how to use cctype's tolower to convert the characters to lowercase

transform(cbegin(foo), cend(foo), begin(foo), static_cast<int (*)(int)>(tolower))

But I've begun to consider locale's tolower, which could be used like this:

use_facet<ctype<char>>(cout.getloc()).tolower(data(foo), next(data(foo), foo.size()));
  • Is there a reason to prefer one of these over the other?
  • Does their functionality differ at all?
  • I mean other than the fact that tolower accepts and returns an int which I assume is just some antiquated C stuff?
Community
  • 1
  • 1
Jonathan Mee
  • 37,899
  • 23
  • 129
  • 288
  • 12
    man, only c++ can make such easy things so difficult... – Willi Mentzel May 27 '16 at 11:32
  • 2
    why the `static_cast` ? Just do `std::transform(foo.cbegin(), foo.cend(), foo.begin(), ::tolower)`. Alternatively, consider [boost's `to_lower`](http://www.boost.org/doc/libs/release/doc/html/boost/algorithm/to_lower.html). – Sander De Dycker May 27 '16 at 11:50
  • 1
    @SanderDeDycker: yes, but he is asking the why! only one reason comes to my mind right now, which i posted as answer... but i guess there are more, maybe also considering performance. – Willi Mentzel May 27 '16 at 11:53
  • 1
    @progressive_overload With Great Power Comes Great Responsibility – exilit May 27 '16 at 11:53
  • @exilit converting a string to lowercase using a 86-liner is the most powerful thing i've ever seen in my life – Willi Mentzel May 27 '16 at 11:55
  • @progressive_overload : I made a comment, not an answer. I didn't claim to answer the OP's question. I pointed out an oddity, and suggested an alternative that I consider to be better than either of the suggested ones. – Sander De Dycker May 27 '16 at 11:59
  • @SanderDeDycker I see, but I would like to know the why as well. :) it is shorter... that is an advantage for sure, but is there more to it? – Willi Mentzel May 27 '16 at 12:00
  • @SanderDeDycker You than check my answer here: http://stackoverflow.com/a/37438120/2642059 `::tolower` is implementation dependent, and I *always* try to avoid Boost. – Jonathan Mee May 27 '16 at 12:01
  • @progressive_overload : boost's `to_lower` is shorter, more readable, and has the option to pass in a locale as well. – Sander De Dycker May 27 '16 at 12:02
  • 3
    @SanderDeDycker Boost always has the massive drawback that you must include the Boost libraries. I recognize there is a place for Boost's convenience, but using it when C++ already provides you not 1 but 2 ways to accomplish this... well it doesn't make any sense to me. – Jonathan Mee May 27 '16 at 12:05
  • 1
    @progressive_overload I don't want to start a flame-war, just saying that: Sure this is specific task could be solved easier, but on the other hand the STL provides you great flexibility (power). And sometimes what's an advantage in one case is a drawback in another. – exilit May 27 '16 at 12:13
  • 1
    @Alex Good catch I've looked at this question like 10 times today and missed it every time. You must program without Intelisense to have the eagle-eye to catch that ;) – Jonathan Mee May 27 '16 at 12:20
  • @JonathanMee : `::tolower` works fine with `#include ` - it's all about choices (I'd personally rather put these few functions in the global namespace than to have to deal with overload disambiguation). And about boost : many people want to avoid it as much as possible - I learned to embrace it, but to each their own. I prefer the readability advantage it provides, as well as the seamless support for non-ASCII encodings and locales. – Sander De Dycker May 27 '16 at 12:21
  • @JonathanMee : oh, and boost does not always require you to include boost libraries. Much of the boost functionality is headers only. Including the functionality I suggested. – Sander De Dycker May 27 '16 at 12:25
  • @SanderDeDycker The standard has deprecated `ctype.h`, hence the use of `cctype` which necessitates the `static_cast`. Anyway even though I don't want to include Boost, I recognize and share your readability concerns. The standard could do a lot better here. – Jonathan Mee May 27 '16 at 12:27
  • @JonathanMee : everyone makes their own choices. Unfortunately, my set of choices is incompatible with yours for this specific subject, so my suggestions weren't useful to you. I apologize. Hopefully they can be useful to someone else in the future :) – Sander De Dycker May 27 '16 at 12:33
  • 2
    Regardless of everything you have to cast to `uint8_t` or `unsigned char` before converting to `int` because otherwise you may get unwanted sign extension depending on your platform! – sehe May 27 '16 at 12:33
  • @sehe Can you elaborate, `string` works with `signed char`s; why would I want to cast to `unsigned char` when using `tolower`? – Jonathan Mee May 27 '16 at 12:35
  • @SanderDeDycker Please don't apologize. I sometimes work in solutions where Boost is already included if I need to do this in such a solution I'll go look up Boost's `tolower`. So you have provided me with some helpful guidance. It's just not the answer that I want for this question. – Jonathan Mee May 27 '16 at 12:37
  • 1
    @JonathanMee std::string uses `char` which may or may not be signed. – sehe May 27 '16 at 12:43
  • @JonathanMee thank post-review for nor not doing syntax highlighting. – Alexander Oh May 27 '16 at 12:47
  • @sehe Isn't `string` defined as `basic_string`? So it will be signed? – Jonathan Mee May 27 '16 at 12:49
  • I already dissected everything you need to see what's wrong. – sehe May 27 '16 at 12:53
  • 1
    @JonathanMee `char` may or may not be signed, that is implementation defined. – Baum mit Augen May 27 '16 at 13:02
  • @BaummitAugen Hmmm, I'm not sure about that, "signed is default if omitted": http://en.cppreference.com/w/cpp/language/types#Modifiers – Jonathan Mee May 27 '16 at 13:09
  • @JonathanMee On the same page, see this passage: "`char` - type for character representation which can be most efficiently processed on the target system (has the same representation and alignment as either signed char or unsigned char, but is always a distinct type)." – milleniumbug May 27 '16 at 13:15
  • @JonathanMee I am sure I'm right. You want me to find the standard quote or do you believe me? :) – Baum mit Augen May 27 '16 at 13:16
  • @BaummitAugen If I have to change everything I've believed about the `signed` modifier could you at least grace me with a citation from the standard? – Jonathan Mee May 27 '16 at 13:23
  • 3
    @JonathanMee Sure thing. ;) *"It is implementation-defined whether objects of `char` type are represented as signed or unsigned quantities. The `signed` specifier forces `char` objects to be signed; it is redundant in other contexts."* 7.1.6.2 [decl.type.simple] in N4140. – Baum mit Augen May 27 '16 at 13:26
  • The C classification functions require the input value to be representable by `unsigned char` or be equal to `EOF`. Thus calling them directly with plain `char` is invalid if it is signed and the value is negative. – T.C. May 27 '16 at 17:04
  • @T.C. So if I understand what you're saying correctly, if I am working with a `signed char[]` using `cctype`'s `tolower` is invalid o.O – Jonathan Mee May 27 '16 at 17:27
  • @JonathanMee Due to my quote above, that might even be true for plain `char[]`. See [this](https://stackoverflow.com/questions/21805674/do-i-need-to-cast-to-unsigned-char-before-calling-toupper). – Baum mit Augen May 27 '16 at 19:11
  • @BaummitAugen So because `locale`'s `tolower` works with `char`s not `int`s, it should be preferred then? That may be as good an argument as any as far as why I should choose one over the other. Are you interested in writing it up, if not I can. – Jonathan Mee May 31 '16 at 10:54
  • @JonathanMee I always just used the C one with the cast, non-trivial string handling was never in the scope of my work. Feel free to write it up and use the potential UB (which is an atrocity, I agree) as argument. – Baum mit Augen May 31 '16 at 21:44
  • @BaummitAugen Welp, I've done it. I've written up an answer citing basically the determining factor being whether you are willing to work with the cast. I expect to accept this tomorrow unless you have any showstopping comments or an answer of your own you'd like to add. – Jonathan Mee Jun 02 '16 at 13:37

3 Answers3

6

Unfortunately,both are equally bad. Although std::string pretends to be a utf-8 encoded string, non of the methods/function (including tolower), are really utf-8 aware. So, tolower / tolower + locale may work with characters which are single byte (= ASCII), they will fail for every other set of languages.

On Linux, I'd use ICU library. On Windows, I'd use CharUpper function.

user3104201
  • 356
  • 1
  • 4
  • 7
  • You're saying that `locale`'s `tolower` can't handle UTF-8 either? Hmmm, that would have been a good argument for it. – Jonathan Mee May 27 '16 at 12:52
  • @JonathanMee Unfortunately, C++ has no meaningful Unicode support in any sense. – Baum mit Augen May 27 '16 at 13:02
  • C++ sucks at this indeed, but are you meaning that in 2016 we can't even have a portable library to handle this ? – kebs May 28 '16 at 10:42
  • @BaummitAugen Are you guys sure about the lack of UTF-8 support? That's actually one of the things demonstrated in the http://en.cppreference.com/w/cpp/locale/tolower example. I haven't been able to come up with a way to make it fail with UTF-8, even when using "multi-byte characters". – Jonathan Mee Jun 01 '16 at 16:03
  • @JonathanMee [Rekt](http://melpon.org/wandbox/permlink/sqfCxno1uuqTMINX), output should be `ω`. – Baum mit Augen Jun 01 '16 at 16:19
  • @JonathanMee And then there is stuff like [this](http://melpon.org/wandbox/permlink/BGpUj2aBJi0APCRh) which should yield `SS`, but have fun building that with the normal `char` types. – Baum mit Augen Jun 01 '16 at 16:25
  • @BaummitAugen You are right :( The input I tested with was just `wchar_t` on Windows not UTF-8. UTF-8 is still broken. Return to your lives citizens. In other news, I happen to have done some personal research on the 'ß' character though. According to the standard it should *not* convert to "SS" nor to 'ẞ': http://stackoverflow.com/a/37571371/2642059 – Jonathan Mee Jun 01 '16 at 18:41
  • @JonathanMee Unicode says it should: *"German sharp s . The German sharp s character has several complications in case mapping. Not only does its uppercase mapping expand in length, but its default case-pairings are asymmetrical. The default case mapping operations follow standard German orthography, which uses the string “SS” as the regular uppercase mapping for U+00DF ß latin small letter sharp s ."* Unicode 8 5.18 And Unicode is the standard that defines the behavior of UTF8, not some C++ standard. – Baum mit Augen Jun 01 '16 at 22:12
4

In the first case (cctype) the locale is set implicitely:

Converts the given character to lowercase according to the character conversion rules defined by the currently installed C locale.

http://en.cppreference.com/w/cpp/string/byte/tolower

In the second (locale's) case you have to explicitely set the locale:

Converts parameter c to its lowercase equivalent if c is an uppercase letter and has a lowercase equivalent, as determined by the ctype facet of locale loc. If no such conversion is possible, the value returned is c unchanged.

http://www.cplusplus.com/reference/locale/tolower/

Willi Mentzel
  • 27,862
  • 20
  • 113
  • 121
1

It should be noted that the language designers were aware of cctype's tolower when locale's tolower was created. It improved in 2 primary ways:

  1. As is mentioned in progressive_overload's answer the locale version allowed the use of the facet ctype, even a user modified one, without requiring the shuffling in of a new LC_CTYPE in via setlocale and the restoration of the previous LC_CTYPE
  2. From section 7.1.6.2[dcl.type.simple]3:

It is implementation-defined whether objects of char type are represented as signed or unsigned quantities. The signed specifier forces char objects to be signed

Which creates an the potential for undefined behavior with the cctype version of tolower's if it's argument:

Is not representable as unsigned char and does not equal EOF

So there is an additional input and output static_cast required by the cctype version of tolower yielding:

transform(cbegin(foo), cend(foo), begin(foo), [](const unsigned char i){ return tolower(i); });

Since the locale version operates directly on chars there is no need for a type conversion.

So if you don't need to perform the conversion in a different facet ctype it simply becomes a style question of whether you prefer the transform with a lambda required by the cctype version, or whether you prefer the locale version's:

use_facet<ctype<char>>(cout.getloc()).tolower(data(foo), next(data(foo), size(foo)));
Community
  • 1
  • 1
Jonathan Mee
  • 37,899
  • 23
  • 129
  • 288