21

Why do most string functions in the C/C++ stdlibs take char* pointers?

The signed-ness of char is not even specified in the standard, though most modern compilers (GCC, MSVC) treat char as signed by default.

When would it make sense to treat strings as (possibly) signed bytes? AFAIK there are no meaningful character values below zero in any character set. For certain string operations, the values must be cast to unsigned char anyway.

So why do the stdlibs use char*? Even C++-specific methods, such as string::string(const char *);?

Unsigned
  • 9,640
  • 4
  • 43
  • 72
  • 2
    Note: Whether `char` is signed or not is implementation defined. – sepp2k Jun 24 '12 at 03:26
  • 53
    Your name suggests you're biased ;) – huon Jun 24 '12 at 03:27
  • 4
    Why null terminated strings instead of a pascal-style length-array pair? I'm sure someone will come up with the fancy explanations but its clear that lots of it will just boil down to historical and backward-compatibility issues. – hugomg Jun 24 '12 at 03:27
  • 1
    @dbaupp - Haha, nice one, I didn't even think of that! – Unsigned Jun 24 '12 at 03:28
  • 6
    The instructions on the PDP-11 dealing with bytes treated them as signed quantities, so that's how the early C compilers treated them, and unsigned didn't even exist. – Jim Balter Jun 24 '12 at 07:11
  • 2
    @missingno, part of the rationale was that having a length would force you either to limit to short (<256 characters) strings or have a two bytes overhead which would be too much for most purpose at the time and on the machine where C was designed (which had a 64KB address space). – AProgrammer Jun 24 '12 at 08:16
  • I'd also like to know why `toupper`, `tolower` and so on are taking as argument an integer. – Maxime Chéramy Jul 03 '13 at 15:24
  • @Maxime - I'd guess that `toupper`/`tolower` predate the `unsigned char` type, and therefore used `int` to be able to hold all possible character values `0-255` – Unsigned Jul 03 '13 at 15:36

7 Answers7

10
  1. I'm pretty sure most of the string functions predate the existence of unsigned char.
  2. Plain char may be either a signed or an unsigned type. The C and C++ standards explicitly allow either one (it's always a separate type from either unsigned char or signed char, but has the same range as one or the other).
  3. While the C string functions use char *, std::string is what's used in most C++.
Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • And of course there is no difference in memory between a pointer to signed/unsigned char – Martin Beckett Jun 24 '12 at 04:34
  • The string functions do predate the addition of 'unsigned' to the language, and the PDP-11 hardware made it more efficient to treat chars as signed, and those were the days of 7-bit ASCII. – Jim Balter Jun 24 '12 at 07:14
10

The C standard is agnostic on the issue of whether plain char is signed or unsigned, and uniquely treats char as distinct from signed char. Furthermore, the base ASCII character set, which includes most major control and English-language printable characters, consists of 128 characters and can therefore be adequately represented by a signed char (at least on any system that provides 8 bits per byte). As Jim Balter points out (see comments below), ASCII does not constitute the complete base character set of the C language, but I'd suspect that it did include the majority of characters in common usage. There is also a massive corpus of C code that relies on properties of (though not necessarily unique to) ASCII (e.g., the NUL special character having a value of zero, alphanumeric characters being arranged sequentially and in ascending order, etc.).

Jesse Good
  • 50,901
  • 14
  • 124
  • 166
Greg E.
  • 2,722
  • 1
  • 16
  • 22
  • 2
    I don't see where you have identified a false premise in the question. The question is actually quite valid, and the answer has to do with history. If the PDP-11 had had instructions that dealt with bytes as unsigned values, then chars would have been unsigned and there would be a lot less buggy code dealing with chars (e.g., every call of the ctype.h is... or to... functions passed a char). – Jim Balter Jun 24 '12 at 07:19
  • @JimBalter, the question has since been edited, but its title and its initial contents seemed to presuppose that `char` is defined as being a signed type by default, which it isn't. As others have explained, `char` is unlike, e.g., `int`, in that the ISO C standard doesn't specify its default sign, so `char` and `signed char` are distinct entities as far as the standard is concerned. That's something which has been addressed on SO in previous questions, and can also be easily answered by a simple Google search. There was also a reference to ASCII in the original question, which I addressed. – Greg E. Jun 24 '12 at 07:33
  • @JimBalter, further, while the question is somewhat more valid as currently stated, it's also partially subjective ("why would it ever make sense to treat strings as signed bytes?" reads like a question asking for a normative response, not a lesson in computing history, but maybe that's just my impression). – Greg E. Jun 24 '12 at 07:35
  • The pre-edited question does not seem *to me* to make the presumption you state. And having served on X3J11, I'm well aware what the standard says about char. As for normative vs. historical, the question makes sense if the asker is not aware that the choice depended on historical contingencies. It was a good question. – Jim Balter Jun 24 '12 at 07:36
  • P.S. Since the question now recognizes that the standard is agnostic, you should edit your answer to fit. SO questions and answers are for everyone, forever, not just for the asker at the moment. – Jim Balter Jun 24 '12 at 07:41
  • P.P.S. Normatively, signed chars are a bad choice, and most languages that treat them as numeric values choose otherwise (and sometimes provide a signed byte type). – Jim Balter Jun 24 '12 at 07:44
  • @JimBalter, why not condense these comments into a response to the OP? – Greg E. Jun 24 '12 at 07:47
  • The C standard is agnostic about the character set. The only mention of ASCII is in regard to trigraphs, which pertain to the source character set, not the execution character set. char sets can use all 8 bits. – Jim Balter Jun 24 '12 at 07:52
  • @JimBalter, the OP specifically made a point about the ASCII character set in relation to the signedness of `char` that I was trying to address, and I haven't made any statements about ASCII w/r/t any C standards. Have I offended you in some way, or is that my imagination? I repeat, as you clearly have a lot to say on the subject, please address a complete response to the OP rather than flooding me with comments. That'll be more useful to everyone. – Greg E. Jun 24 '12 at 07:59
  • There is no mention of ASCII in the current question, and the question implies the use of chars with more than 7 bits. Your answer gives the impression that 7-bit ASCII is the base character set of the C language, which is incorrect. – Jim Balter Jun 24 '12 at 08:02
  • @JimBalter, there was mention of ASCII in the original question, and you'd have to read an awful lot into my response to interpret it as making a global statement about C's base character set. Further, I think my statement w/r/t to ASCII adds value, is still relevant to the question as currently phrased, and is, AFAIK, completely factual. – Greg E. Jun 24 '12 at 08:07
  • "there was mention of ASCII in the original question" -- Of course I KNOW that, Greg, since as you know I read it, and I said "current" question. As for the rest, I disagree, but that's life. Have a good one and stop taking everything so personally. – Jim Balter Jun 24 '12 at 08:17
  • 1
    @JimBalter, you're right, I'll try to add some context to my statement re: ASCII, in light of the change to the OP's question. Thanks, and apologies for my argumentative tone. – Greg E. Jun 24 '12 at 08:20
  • @JimBalter, does the current iteration of my response address your concern re: ASCII? – Greg E. Jun 24 '12 at 08:29
  • @JimBalter, thanks. BTW, I still believe you should formulate a separate response to the OP. You have expert knowledge of this domain, and the info you've provided shouldn't remain buried in the obscure depths of a comment section. – Greg E. Jun 24 '12 at 08:36
5

Jim Balter notes in a comment that

The instructions on the PDP-11 dealing with bytes treated them as signed quantities, so that's how the early C compilers treated them, and unsigned didn't even exist.

I strongly suspect that this is the answer to why the default character type char isn’t required to be unsigned, but one would need a quote from some written historical account in order to be sure.

As to why it isn’t required to be signed either (!), on a non-two's complement machine such as (the only one I know that's possibly still in use) a Clearpath Dorado, a signed char cannot hold all values of an unsigned char, since it's wasting one bitpattern on a negative zero, or whatever that bitpattern is put to use for. If char were required to be signed then this would be a problem for reinterpreting general data as a sequence of char value. Consequently, on such a machine char has to be unsigned, or else the software will have to be engaging in extreme contortions to deal with it.

Cheers and hth. - Alf
  • 142,714
  • 15
  • 209
  • 331
  • 1
    Just a disclaimer: it's been ~30 years since I last touched a PDP-11 assembler, and I can't really recall how it dealt with bytes, or if it did provide e.g. single byte multiplication and division, i.e. whether it *makes sense* to say that it treated bytes as signed quantities. So my suspicion is wholly based on the notion that @Jim Balter knows what he's talking about, and that it doesn't sound far fetched. I don't any longer know the PDP-11 stuff of my own recollection (just about all I remember is that PDP-11 assembly involved @ signs, and the registers were numbered and memory mapped). – Cheers and hth. - Alf Jun 24 '12 at 10:54
  • @Alf, promotion rules insure that all computations is made on int, not on short nor char. So only question is, is it easier to sign extend or zero extend a char to an int. (About the needs for allowing `char` to be unsigned, even if only C++ makes explicit the requirement that characters in the basic set are non-negative, I'm pretty sure it is corresponding to the practice for C, and implementations allowing for EBCDIC have `char` unsigned). – AProgrammer Jun 24 '12 at 11:07
  • @AProgrammer: there are no promotion rules in assembly language. when we're talking about assembly language, or rather, machine code instructions, we're talking about what's convenient for a compiler's code generation, and in particular for a compiler at the time the C language was formed, which was in the early and middle 1970's, while the first C standard came in 1989. the non-negative assumption for `char` values is present in a number of standard library functions such as `isupper` (where it trips up novices). – Cheers and hth. - Alf Jun 24 '12 at 11:15
  • 1
    @Cheersandhth.-Alf: I love the historic view of your post and sometimes computer science classes would make so much more space, if people would point out that certain things work the way they do because of history and not because of logic. – Alexander Oh Jul 15 '12 at 19:32
  • @Cheersandhth.-Alf: On processors with registers that are larger than bytes, some have instructions for "load byte into a word register with zero padding", some have "load byte into word register with sign extension", some have "load byte into part of word register, leaving remainder unaffected", and some have two or more of the above. For processors which have only one of the first two forms, I would regard that form as the "promotion rule" used for assembly language targeting those processors. – supercat Jan 06 '17 at 15:30
  • @Alex: A lot of people seem to regard various aspects of the Standard with a bizarre level of esteem. A lot of the flexibility the Standard allows implementations was intended to avoid forcing existing unusual implementations to change in ways that might make them less useful for the purposes they were already serving (and were thus obviously suitable), and not to invite compilers to get creative when generating code for commonplace platforms. – supercat Jan 06 '17 at 15:37
2

As Bjarne said in The C++ Programming Language, whether a char is taken as signed or unsigned is implementation dependent, and the C++ language provide two types for each implementation.

xvatar
  • 3,229
  • 17
  • 20
2

Others have gone into the historical reasons for it to have been this way when C was first devised and (later) standardised, but there's another reason why this seeming anomaly persists to this day.

It's simply that when you're using char for characters, you don't need to know whether it's signed or unsigned. The standard library provides portable functions for operating on characters regardless of their representation. If you ignore those functions and insist on doing comparisons and arithmetic on characters, you deserve every bug you get.

To take a simple example, it's quite commonplace to check whether a character is printable using the expression c >= ' ' or equivalently c >= 0x20, but you should just use isprint(c) instead. That way, you're not exposing yourself to signed/unsigned confusion and potentially introducing platform-dependent errors into your program.

Once you get into the habit of using signed char and unsigned char only as small (usually 8-bit) integers for arithmetic, and you use only char when you're operating on character data, it'll seem completely natural that char is a separate type with implementation-defined signedness, and even more natural that string processing functions always use char and char * rather than the signed or unsigned variants. The signedness of char seems about as relevant as the signedness of bool.

Dan Hulme
  • 14,779
  • 3
  • 46
  • 95
  • 1
    -1 the above is incorrect. the C standard requires that the argument to a classification function must be non-negative or else EOF. hence, to use these functions correctly the actual argument must be casted to `unsigned char`. otherwise you have formal Undefined Behavior for non-ASCII characters. and e.g. the visual c++ debug runtime library catches this for some functions, and (even though the program would work if not for this!) crashes your program... – Cheers and hth. - Alf Jul 16 '12 at 00:19
0

Char is neither signed nor unsigned by standard. See https://stackoverflow.com/a/2054941/396583

Community
  • 1
  • 1
vines
  • 5,160
  • 1
  • 27
  • 49
  • 9
    Correction: `char` *is* either signed or unsigned (but it's a distinct type from both `signed char` and `unsigned char`). – Keith Thompson Jun 24 '12 at 03:34
  • @keith it is a bit complicate in c++ (nit sure if it is the same way in c). in c++ char is either signed or unsigned and it is an integer type. but it is *not* a signed integer type nor an unsigned integer type. So you need to be very careful how you word specific statememts. – Johannes Schaub - litb Jun 24 '12 at 09:18
  • @JohannesSchaub-litb How come `char is either signed or unsigned and it is an integer type. but it is not a signed integer type nor an unsigned integer type`? The last part contradicts the first part? – Alexey Frunze Jun 24 '12 at 09:24
  • @alex no it does not contradict tbe first part. There is no formal definition of *signed type* so people and also the standard itself take it to mean "type that can represent negative values". But there is a definition of *signed integral type* which explicitly lists all types. char aswell as bool are not included in that list. You will find that `numeric_limits::is_signed` yields true and it is specified by "T is signed". – Johannes Schaub - litb Jun 24 '12 at 09:29
  • @JohannesSchaub-litb http://ideone.com/FZ2Ms acknowledges the difference between the char and integral types. So one should be able to overload `` functions seperately for (plain) `char` and `(u)int8_t` aka `(un)signed char`, no? Completely off-topic, but hey, mildly relevant `:)` – rubenvb Jun 24 '12 at 10:13
0

Why do most string functions in the C/C++ stdlibs take char* pointers?

In C++ one use std::string. In C, the usage patterns were already too established when unsigned types were introduced and I wouldn't exclude an efficiency concern.

no meaningful character values below zero

Well there is a constraint somewhere in the C++ standard that characters in the basic characters set are positive. But it's naïve to think that that constraint holds for all characters.

That constraint forces implementations which allows EBCDIC as encoding system to have their char unsigned.

Most modern compilers (GCC, MSVC) treat char as signed by default.

gcc behaviour depends on the target and has options to change the target's default.

AProgrammer
  • 51,233
  • 8
  • 91
  • 143
  • "there is a constraint somewhere in the standard that characters in the basic characters set are positive" -- No, there is no such constraint. The only contraint (in addition to including a minimal set of characters and the integers being contiguous) is that they fit in a byte. – Jim Balter Jun 24 '12 at 08:32
  • @JimBalter, see C++1998, 2.2/3, C++2011 2.3/3 (it uses non-negative, obviously \0 has a zero value) but I've noted in my archives that I haven't found the corresponding constraint in the C standards (this note dates from before C11 so I haven't searched there, but I probably looked in C90 and C99; it isn't in 5.2.1/3 in C11 which is the direct equivalent of 2.3/3 in C++11). I've added a qualification. – AProgrammer Jun 24 '12 at 09:01
  • Sorry, I forgot that this was tagged both C and C++ and that you were addressing C++ as well as C. Thanks for the qualification. The C standard does say that the arguments to the ctype functions must be representable as unsigned char (or EOF), which in practice could be taken as implying that the char set is positive. – Jim Balter Jun 24 '12 at 09:10
  • @JimBalter, I don't think so. ctype functions take for argument an int which is either EOF or a char cast to unsigned char (which is exactly what getc returns BTW). I've used them with Latin1 locale on implementations with a signed char, thus with negative characters. – AProgrammer Jun 24 '12 at 09:21
  • I think it's a semantic quibble. Certainly chars with the 8th bit set have negative values when stored in signed char, or char implemented as signed, but that doesn't mean that the character set contains negative values; the API (getc and ctype) implies otherwise. I think the C++ constraint clarifies this. – Jim Balter Jun 24 '12 at 09:27