35

A while ago, someone with high reputation here on Stack Overflow wrote in a comment that it is necessary to cast a char-argument to unsigned char before calling std::toupper and std::tolower (and similar functions).

On the other hand, Bjarne Stroustrup does not mention the need to do so in the C++ Programming Language. He just uses toupper like

string name = "Niels Stroustrup";

void m3() {
  string s = name.substr(6,10);  // s = "Stroustr up"
  name.replace(0,5,"nicholas");  // name becomes "nicholas Stroustrup"
  name[0] = toupper(name[0]);   // name becomes "Nicholas Stroustrup"
}

(Quoted from said book, 4th edition.)

The reference says that the input needs to be representable as unsigned char. For me this sounds like it holds for every char since char and unsigned char have the same size.

So is this cast unnecessary or was Stroustrup careless?

Edit: The libstdc++ manual mentions that the input character must be from the basic source character set, but does not cast. I guess this is covered by @Keith Thompson's reply, they all have a positive representation as signed char and unsigned char?

Baum mit Augen
  • 49,044
  • 25
  • 144
  • 182

5 Answers5

38

Yes, the argument to toupper needs to be converted to unsigned char to avoid the risk of undefined behavior.

The types char, signed char, and unsigned char are three distinct types. char has the same range and representation as either signed char or unsigned char. (Plain char is very commonly signed and able to represent values in the range -128..+127.)

The toupper function takes an int argument and returns an int result. Quoting the C standard, section 7.4 paragraph 1:

In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF . If the argument has any other value, the behavior is undefined.

(C++ incorporates most of the C standard library, and defers its definition to the C standard.)

The [] indexing operator on std::string returns a reference to char. If plain char is a signed type, and if the value of name[0] happens to be negative, then the expression

toupper(name[0])

has undefined behavior.

The language guarantees that, even if plain char is signed, all members of the basic character set have non-negative values, so given the initialization

string name = "Niels Stroustrup";

the program doesn't risk undefined behavior. But yes, in general a char value passed to toupper (or to any of the functions declared in <cctype> / <ctype.h>) needs to be converted to unsigned char, so that the implicit conversion to int won't yield a negative value and cause undefined behavior.

The <ctype.h> functions are commonly implemented using a lookup table. Something like:

// assume plain char is signed
char c = -2;
c = toupper(c); // undefined behavior

may index outside the bounds of that table.

Note that converting to unsigned:

char c = -2;
c = toupper((unsigned)c); // undefined behavior

doesn't avoid the problem. If int is 32 bits, converting the char value -2 to unsigned yields 4294967294. This is then implicitly converted to int (the parameter type), which probably yields -2.

toupper can be implemented so it behaves sensibly for negative values (accepting all values from CHAR_MIN to UCHAR_MAX), but it's not required to do so. Furthermore, the functions in <ctype.h> are required to accept an argument with the value EOF, which is typically -1.

The C++ standard makes adjustments to some C standard library functions. For example, strchr and several other functions are replaced by overloaded versions that enforce const correctness. There are no such adjustments for the functions declared in <cctype>.

Keith Thompson
  • 254,901
  • 44
  • 429
  • 631
  • 1
    I've given you a +1 cause the answer is good. But why are you quoting from the C standard in a C++ question? – Jonathan Mee Jun 03 '16 at 11:46
  • 5
    @JonathanMee: Good question. It's because C++ inherits most of C's standard library and defers its definition to the C standard. – Keith Thompson Jun 03 '16 at 15:27
  • I may have gotten impatient waiting for a response and opened a new question. I thought I should link it here in full disclosure: http://stackoverflow.com/questions/37614714/what-is-the-relationship-between-the-c-and-c-standards Feel free to chime in. – Jonathan Mee Jun 03 '16 at 15:35
  • 1
    The conversion back from `int` to `char` is implementation defined, isn't it? – L. F. Aug 28 '19 at 04:27
  • @L.F.: Only if the value is outside the range of `char`, which it almost certainly won't be given that the argument was a `char` value. – Keith Thompson Aug 28 '19 at 07:41
  • 1
    @KeithThompson I mean, suppose that `char` is signed and the value is -42. Then it is converted to `unsigned char` (213), and to `int` (213). Now isn't the result of `(char) 213` implementation defined? – L. F. Aug 28 '19 at 08:30
  • 1
    @L.F.: Yes, good point! (BTW, it's 214, not 213.) Or it can raise an implementation-defined signal, though I don't think any implementation does that. In practice, it's unlikely to cause any problems. – Keith Thompson Aug 28 '19 at 09:02
  • 1
    A minor point - you said "The [] indexing operator on std::string returns a char value." which is inaccurate. The `[]` operator returns a reference to char, not a char value. – aafulei Dec 22 '21 at 06:22
  • 1
    @aafulei Fixed, thanks. – Keith Thompson Dec 22 '21 at 22:38
  • I love this answer, and will personally be citing it when I see this bug (as we do), but it'd be even better if there were more emphasis on the erroneousness of an `unsigned` cast, since your answer kinda reads as the first paragraph and then some will look for something to copy/paste... They may see `c = toupper((unsigned)c);` and think this is the ultimate solution. I'd suggest a nice, clear title for each of **the wrong ways**, and then another clear title for **the right way** using `c = tolower((unsigned char) c);` as the example. – autistic Sep 25 '22 at 05:18
5

The reference is referring to the value being representable as an unsigned char, not to it being an unsigned char. That is, the behavior is undefined if the actual value is not between 0 and UCHAR_MAX (typically 255). (Or EOF, which is basically the reason it takes an int instead of a char.)

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
Sneftel
  • 40,271
  • 12
  • 71
  • 104
  • 2
    As the parameter of `toupper` is an `int`, I think negative `char` values could cause UB. Any conversion from `int` to `unsigned char` happens internally in the function. – dyp Feb 16 '14 at 00:35
  • 1
    Nobody said that an `unsigned char` cannot represent values greater than 255. – Kerrek SB Apr 09 '15 at 21:40
  • 2
    @dyp "Any conversion from `int` to `unsigned char` happens internally in the function." --> Not quite as that may convert `EOF` to 255. _After_ coping with `EOF`, conversion to `unsigned char` would be reasonable, yet that behavior is not specified. – chux - Reinstate Monica Feb 22 '18 at 20:40
3

In C, toupper (and many other functions) take ints even though you'd expect them to take chars. Additionally, char is signed on some platforms and unsigned on others.

The advice to cast to unsigned char before calling toupper is correct for C. I don't think it's needed in C++, provided you pass it an int that's in range. I can't find anything specific to whether it's needed in C++.

If you want to sidestep the issue, use the toupper defined in <locale>. It's a template, and takes any acceptable character type. You also have to pass it a std::locale. If you don't have any idea which locale to choose, use std::locale(""), which is supposed to be the user's preferred locale:

#include <algorithm>
#include <iostream>
#include <iterator>
#include <locale>
#include <string>

int main()
{
    std::string name("Bjarne Stroustrup");
    std::string uppercase;

    std::locale loc("");

    std::transform(name.begin(), name.end(), std::back_inserter(uppercase),
                   [&loc](char c) { return std::toupper(c, loc); });

    std::cout << name << '\n' << uppercase << '\n';
    return 0;
}
Max Lybbert
  • 19,717
  • 4
  • 46
  • 69
  • 1
    Yes, it's correct for C. Why do you think the same thing doesn't apply to C++? – Keith Thompson Feb 16 '14 at 01:11
  • 4
    Its not needed in C either if you pass it an `int` in the first place. It *is* needed if you are passing a `char` in *either*. – WhozCraig Feb 16 '14 at 01:11
  • @KeithThompson I haven't checked the Standard, but frankly the reason I don't think the cast is needed for C++ is simply that I've only ever seen the advice to cast in C projects. It's possible that I simply haven't read the right articles, but I find it interesting that I've never seen a C++ expert mention the need for a cast while I have seen C experts mention it. – Max Lybbert Feb 16 '14 at 01:55
  • 4
    C++ includes most of the C standard library by reference (C++11 refers to the C99 library, but `` hasn't changed much, if at all, from C90 to C99 to C11). There are a few cases where C++ makes changes to the C standard library, but I see no mention of any such changes to ``. I think the C++ experts are just missing something. (`toupper(c)` is "safe" if its argument is known to be in the basic character set.) – Keith Thompson Feb 16 '14 at 03:37
1

Sadly Stroustrup was careless :-(
And yes, latin letters codes should be non-negative (and no cast are required)...
Some implementations correctly works without casting to unsigned char...
By the some experience, it may cost a several hours to find the cause of segfault of a such toupper (when it is known that a segfault are there)...
And there are also isupper, islower etc

user3277268
  • 165
  • 3
  • Arguably, care*ful* - the example in the question uses only characters in the source character set. They are guaranteed to work whether `char` is signed or unsigned. – Toby Speight Jul 24 '18 at 13:57
0

Instead of casting the argument as unsigned char, you can cast the function. You will need to include functional header. Here's a sample code:

#include <string>
#include <algorithm>
#include <functional>
#include <locale>
#include <iostream>

int main()
{
    typedef unsigned char BYTE; // just in case

    std::string name("Daniel Brühl"); // used this name for its non-ascii character!

    std::transform(name.begin(), name.end(), name.begin(),
            (std::function<int(BYTE)>)::toupper);

    std::cout << "uppercase name: " << name << '\n';
    return 0;
}

The output is:

uppercase name: DANIEL BRüHL

As expected, toupper has no effect on non-ascii characters. But this casting is beneficial for avoiding unexpected behavior.

polfosol ఠ_ఠ
  • 1,840
  • 26
  • 41