Confusing behavior of sizeof with chars

Question

#include <stdio.h>
#include <string.h>

int main(void)
{
    char ch='a';

    printf("sizeof(ch)          = %d\n", sizeof(ch));
    printf("sizeof('a')         = %d\n", sizeof('a'));
    printf("sizeof('a'+'b'+'C') = %d\n", sizeof('a'+'b'+'C'));
    printf("sizeof(\"a\")       = %d\n", sizeof("a"));
}

This program uses sizeof to calculate sizes. Why is the size of 'a' different from the size of ch (where ch='a')?

sizeof(ch)          = 1
sizeof('a')         = 4
sizeof('a'+'b'+'C') = 4
sizeof("a")         = 2

You should be using `%zu` as `sizeof` returns `size_t` not `int` — Spikatrix, Jul 04 '18 at 12:43
You need to tag this either C or C++, because this code will give very different answers depending on language. Basically C++ recognized that C was being stupid and fixed various obvious language flaws, while C refuses to admits that it is stupid. — Lundin, Jul 04 '18 at 12:45
@Sam Varshavchik Not necessarily a dupe because the first two rows will give 1 vs 4 in C, but 1 vs 1 in C++. The 3rd row will indeed mess around with implicit promotion in C++, but not in C. — Lundin, Jul 04 '18 at 12:47
Odd interpretation of the word "duplicate" here. I've reopened. Disk is cheap. Search engines are powerful. Let's only close as duplicates if it's a duplicate. — Bathsheba, Jul 04 '18 at 12:52
@user202729: In your opinion, with respect. When researching, it's always good to have a selection of sources. This quixotic closing to broad so-called duplicates is the thing that makes no sense. — Bathsheba, Jul 04 '18 at 15:36
I can find 3 partial duplicate targets, each answer a part of the question. I am flagging to close as **too broad**. — user202729, Jul 04 '18 at 15:44
@user202729 It’s not too broad. It asks a very specific, real-world question about software engineering. And it does not appear to have an exact duplicate. — Davislor, Jul 04 '18 at 23:10
@Davislor If it can be splitted into 3 different questions (each of which is on-topic for [so]), it's too broad. — user202729, Jul 05 '18 at 03:30
Also, it doesn't have an exact duplicate precisely because it's too broad, in this case. — user202729, Jul 05 '18 at 07:13
I don't get why is this duplicate. The answer to this question is "Because C characters literals are ints". "Why are C characters literals ints" is different question which I can not ask before I know that C characters literals are ints. Right? The seconds question implies that you already know the answer to the first question. But you don't. — kotlomoy, Jul 08 '18 at 09:44
@kotlomoy I searched a lot before asking this question .I aint sure what two users felt before marking this as duplicate. — AmanSharma, Jul 10 '18 at 02:47

Sourav Ghosh · Answer 1 · 2018-07-05T05:30:15.803

52

TL;DR - sizeof works on the type of the operand.

sizeof(ch) == sizeof (char)-------------------(1)
sizeof('a') == sizeof(int) --------------------(2)
sizeof ('a'+ 'b' + 'c') == sizeof(int) ---(3)
sizeof ("a") == sizeof (char [2]) ----------(4)

Let's see each case now.

ch is defined to be of char type, so , pretty straightforward.
In C, sizeof('a') is the same as sizeof (int), as a character constant has type integer.

Quoting C11,

An integer character constant has type int. [...]

In C++, a character literal has type char.
sizeof is a compile-time operator (except when the operand is a VLA), so the type of the expression is used. As earlier , all the integer character constants are of type int, so int + int + int produces int. So the type of the operand is taken as int.
"a" is an array of two chars, 'a' and 0 (null-terminator) (no, it does not decay to pointer to the first element of the array type), hence the size is the same as of an array with two char elements.

That said, finally, sizeof produces a result of type size_t, so you must use %zu format specifier to print the result.

edited Jul 05 '18 at 05:30

answered Jul 04 '18 at 12:43

Sourav Ghosh

133,132
16
183
261

I wonder in what problems, if any, would have resulted from making `sizeof` operate only on lvalues? Especially on systems that use FLT_EVAL_METHOD==2 and have a `long double` type which is bigger than `double`, it would seem a bit weird to suggest that `sizeof (1.0/10.0)` should report 8 even if `long double d = (1.0/10.0);` would store a value that cannot be represented in an 8-byte "double". – supercat Jul 04 '18 at 18:04
Suggest "character constant has type integer" --> "character constant has type `int`". – chux - Reinstate Monica Jul 04 '18 at 20:03
1

@supercat Among other things it would probably have made the following construct illegal: `sizeof(type)`. – dgnuff Jul 05 '18 at 04:01
@dgnuff: Mea culpa. What I meant was excluding the operator on values that aren't lvalues. – supercat Jul 05 '18 at 06:59
@supercat, relatively few problems, I suspect. A bit of cognitive dissonance, for one: it is nicely consistent that `sizeof` works on *all* expressions. As for practical programming issues, the one I see is the case where you want the size of a string literal, and especially where that literal is conveyed via a macro, so that its size may be changed at some location distant from its use -- maybe even on the compiler command line. – John Bollinger Jul 05 '18 at 13:53
@JohnBollinger: From what I can tell, what makes sizeof is useful with string literals is that they are `char[] const` lvalues (which would be allowed if sizeof were restricted to lvalues) rather than `char const*` values. As for having it work on "all" expressions, the `&` operator doesn't, so why should `sizeof` be special in that regard? – supercat Jul 05 '18 at 14:38
You're right, @supercat, a string literal is an lvalue, so that's not a problem. As for working with all expressions, however, I submit that many C operators work with any operand of suitable type. Among the unary operators, for example, there are `-`, `!`, and `~`, and even the function-call operator, `()`. C requires operands to be lvalues only where it needs to refer to *storage*, which, of course, is what distinguishes lvalues from non-lvalue expressions. All expressions have sizes as determined by their types, whether or not they have any associated storage. – John Bollinger Jul 05 '18 at 14:55
@JohnBollinger: Within a function like `void foo(someArrayType arr);` the expression `arr` is clearly a value, but attempting to use `sizeof` on that type is unlikely to yield an intended result. IMHO, a clean way of preventing such nonsense would be to say that within such a function, the expression `arr` would not be an lvalue, but would instead yield a value of pointer-to-member type, but only if `sizeof` required an lvalue. – supercat Jul 05 '18 at 16:43
@supercat, I take your point, but I observe that C says that there are no functions with parameters of array type, because declarations that have a form that would declare other identifiers as arrays declare function parameters as pointers, instead. I think I understand why that decision was made, but I'd say that the nonsense in this area is in allowing such a deceptive form of declaration in the first place, not in handling the resulting parameters according to the type with which they are (actually) declared. – John Bollinger Jul 05 '18 at 17:55
@JohnBollinger: If a function's parameters is *declared* as being an array type, it may not be necessary to have the parameter "pre-decomposed" into a pointer, rather than having it be an array lvalue which gets converted into a pointer value in many contexts, but treating it as an lvalue of pointer type was an "unforced error". – supercat Jul 05 '18 at 18:24

Bathsheba · Answer 2 · 2018-07-04T20:10:04.817

23

In C, 'a' is constant of type int. It is not a char. So sizeof('a') will be the same as sizeof(int).

sizeof(ch) is the same as sizeof(char). (The C standard guarantees that all alphanumeric constants - and some others - of the form 'a' can fit into a char, so char ch='a'; is always well-defined.)

Note that in C++, 'a' is a literal of type char; yet another difference between C and C++.

In C, sizeof("a") is sizeof(char[2]) which is 2. sizeof does not instigate the decay of an array type to a pointer.

In C++, sizeof("a") is sizeof(const char[2]) which is 2. sizeof does not instigate the decay of an array type to a pointer.

In both languages, 'a'+'b'+'C' is an int type due, in C++, to implicit promotion of integral types.

edited Jul 04 '18 at 20:10

answered Jul 04 '18 at 12:43

Bathsheba

231,907
34
361
483

2

Great answer but for the very minor issue of not being explicit about `'a'+'b'+'C'` being an example of *integral promotion*, not *integral conversion*, in standard terms. (Both are *conversions* though, because this is also used as an umbrella term. The naming is… interesting.) – Arne Vogel Jul 04 '18 at 14:06
1

@ArneVogel: Thank you, if I had a dollar every time I say or write that incorrectly... – Bathsheba Jul 04 '18 at 14:07
1

@chux Thanks, I’ve fixed but I think I’ll leave all the C++ stuff up - the joys of a moving question! – Bathsheba Jul 04 '18 at 20:10

score 9 · Answer 3 · answered Jul 04 '18 at 13:13

9

First of all, the result of sizeof is type size_t, which should be printed with the %zu format specifier. Ignoring that part and assuming int is 4 bytes, then

printf("sizeof(ch) %d\n",sizeof(ch)); will print 1 in C and 1 in C++.

This is because a char is per definition guaranteed to be 1 byte in both languages.
printf("sizeof('a') %d\n",sizeof('a')); will print 4 in C and 1 in C++.

This is because character literals are of type int in C, for historical reasons¹⁾, but they are of type char in C++, because that's what common sense (and ISO 14882) dictates.
printf("sizeof('a'+'b'+'C) %d\n",sizeof('a'+'b'+'C')); will print 4 in both languages.

In C, the resulting type of int + int + int is naturally int. In C++, we have char + char + char. But the + invokes implicit type promotion rules so we end up with int in the end no matter.
printf("sizeof(\"a\") %d\n",sizeof("a")); will print 2 in both languages.

The string literal "a" is of type char[] in C and const char[] in C++. In either case we have an array consisting of an a and a null terminator: two characters.

As a side note, this happens because the array "a" does not decay into a pointer to the first element when operand to sizeof. Should we provoke an array decay by for example writing sizeof("a"+0), then we would get the size of a pointer instead (likely 4 or 8).

¹⁾ Somewhere in the dark ages there were no types and everything you wrote would boil down to int no matter. Then when Dennis Ritchie started to cook together some manner of de facto standard for C, he apparently decided that character literals should always be promoted to int. And then later when C was standardized, they said that character literals are simply int.

Upon creating C++, Bjarne Stroustrup recognize that all of this didn't make much sense and made character literals type char as they ought to be. But the C committee stubbornly refuses to fix this language flaw.

answered Jul 04 '18 at 13:13

Lundin

195,001
40
254
396

My copies of the C89 and C99 standard define `sizeof` to return counts of "storage units", not "bytes", whatever those are. – Eric Towers Jul 04 '18 at 19:08
@EricTowers "byte" is today typically only used for 8 bits, but `sizeof` returns the number of `char`s - and a C `char` can be larger than 8 bits (it's 16 bits on a CPU I'm working with, for example). – pipe Jul 04 '18 at 19:54
@pipe : And a byte has been 9-bits on architectures I've worked on. My point is that, since the standard does not define or use "byte"s it is incorrect to have "is per definition guaranteed to be 1 byte". – Eric Towers Jul 04 '18 at 19:56
1

Detail: C standard does not have _character literals_. It does have _character constants_ which are type `int`. C's 2 literals: _string_ and _compound_ can have their address taken, unlike constants. – chux - Reinstate Monica Jul 04 '18 at 20:12
@EricTowers C11/C99 6.5.3.4/2 or C90 6.3.3.4 "The sizeof operator yields the size (in bytes) of its operand". Maybe cite the standard next time before making up such statements. – Lundin Jul 05 '18 at 06:15
ISO/IEC 9899:1990 3.6. Summarizing: "bytes" != bytes. For more on this discrepancy, see https://www.misra.org.uk/forum/viewtopic.php?t=973 – Eric Towers Jul 05 '18 at 07:01
@EricTowers Yes I am well-aware that the standard allows a byte to be something else than 8 bits. Nothing in this answer contradicts that. As proven by quoting normative text in the 3 latest C standards, the sizeof operator returns the size in bytes. I have only ever spoken about bytes. – Lundin Jul 05 '18 at 07:10
@Lundin : You have not spoken about bytes. You have spoken about "bytes". And as cited, using Standard meanings of plain language words in semantically conflatable settings is misleading. – Eric Towers Jul 05 '18 at 07:25
@EricTowers What it boils down to is that anyone designing C programs for compatibility with wildly exotic DSP:s are wasting their time almost as much as people writing pedantic comments on internet sites along the lines of: "but a byte might be 57 bits!", "but an int might have padding bits and there will be trap representations!", "but this system might be a 33 bit CPU signed magnitude computer!" etc. Sure the standard allows it, but wasting energy caring about it is a huge waste of everyone's time. Focus on portability to mainstream computers. – Lundin Jul 05 '18 at 07:46
What it boils down to is anyone reading the term "bytes" in your Answer without comment that the Standard doesn't mean bytes will think you mean bytes. While you say *you* are aware of this defect of the Standard, you are apparently unaware of the widespread confusion on the issue. It's related to your undocumented claim "Then when Dennis Ritchie started to cook together some manner of de facto standard for C, he apparently decided that character literals should always be promoted to int.", which is in direct contrast with the documented reason : the PDP-11 had no 8-bit GP register. – Eric Towers Jul 05 '18 at 15:31
@Lundin: Unfortunately, the authors of the C Standard seem opposed to the idea of recognizing mainstream compilers and platforms, and would rather saddle the 99% of programs that nobody would ever have any interest in running on anything other than octet-based linear-address architectures with two's-complement silent-wraparound integer semantics, with the limitations of quirky architectures which in some cases might not even exist [e.g. those where left-shifting a negative number would do anything other than multiply be a power of two in cases where such a multiply would not overflow]. – supercat Jul 05 '18 at 16:50
When C was first designed, there were only three kinds of *values*: pointers, integers, and double-precision floating-point. Evaluating an object would promote its value to the largest type of the appropriate kind. The type of a character literal value had to be `int` because there was no such thing as a *value* of `char` type. – supercat Jul 05 '18 at 18:56

score 2 · Answer 4 · answered Jul 04 '18 at 23:52

2

As others have mentioned, the C language standard defines the type of a character constant to be int. The historical reason for this is that C, and its predecessor B, were originally developed on DEC PDP minicomputers with various word sizes, which supported 8-bit ASCII but could only perform arithmetic on registers. Early versions of C defined int to be the native word size of the machine, and any value smaller than an int needed to be widened to int in order to be passed to or from a function, or used in a bitwise, logical or arithmetic expression, because that was how the underlying hardware worked.

That is also why the integer promotion rules still say that any data type smaller than an int is promoted to int. C implementations are also allowed to use one’s-complement math instead of two’s-complement for similar historical reasons, and the fact that character escapes default to octal and octal constants start with just 0 and hex needs \x or 0x is that those early DEC minicomputers had word sizes divisible into three-byte chunks but not four-byte nibbles.

Automatic promotion to int causes nothing but trouble today. (How many programmers are aware that multiplying two uint32_t expressions together is undefined behavior, because some implementations define int as 64 bits wide, the language requires that any type of lower rank than int must be promoted to a signed int, the result of multiplying two int multiplicands has type int, the multiplication can overflow a signed 64-bit product, and this is undefined behavior?) But that’s the reason C and C++ are stuck with it.

answered Jul 04 '18 at 23:52

Davislor

14,674
2
34
49

Thanx..Good Research though – AmanSharma Jul 05 '18 at 07:04
Note that the authors of the Standard have expressly recognized the possibility that an implementation may be conforming and yet be of such poor quality as to be useless, but assume that quality implementations won't go out of their way to behave in the least-useful fashion the Standard would permit. The Rationales for all versions of the Standard describe expressions where they would expect quality commonplace implementations to treat signed and unsigned math identically. The UB resulting from unwanted promotions to signs will only be a problem when using low-quality compilers... – supercat Jul 05 '18 at 16:58
...(which, for whatever reason, programmers have become all too willing to tolerate). The fact that a piece of code won't work on a compiler that designed to be of needlessly-poor quality doesn't mean the code is broken. It would be impossible to write any program that couldn't be sunk by a "conforming implementation" of sufficiently poor quality. – supercat Jul 05 '18 at 16:58
@supercat [It was your answer about how C is not a “safe” programming language](https://cs.stackexchange.com/a/93817/40057) that brought that example to mind. :) – Davislor Jul 05 '18 at 17:24
@supercat I agree that a lot of language-lawyering isn’t especially relevant to coding today. Sometimes, for fun, I point out loopholes based on the fact that one’s-complement or sign-and-magnitude arithmetic are still technically allowed. But they’re only used in a few mainframe architectures from the ’60s (although UNIVAC does still support one of those). Or that there is an implementation that supports EBCDIC as the source and execution character set – Davislor Jul 05 '18 at 17:49
@Davislor: If the Standard recognized a concept of a "limited implementation" which can't support all features, but will reject programs that require or may require features it can't support, then a C99 implementation could be practical on the Univac. I don't think there's any practical way a ones'-complement or sign-magnitude machine can efficiently handle a uint64_least_t or unsigned long long type without also being able to efficiently process two's-complement arithmetic unless its basic word size was 65 bits or longer. – supercat Jul 05 '18 at 18:36
@supercat It does recognize a distinction between a hosted and freestanding implementation, but yes. Emulating higher-precision math would be difficult, and you couldn’t have the exact-width types we were talking about anyway, because padding bits are not allowed. – Davislor Jul 05 '18 at 18:39
1

@Davislor: The latest Univac C implementation I've read about supported a 72-bit "long long" type, but not an unsigned equivalent. The documentation didn't say how the "long long" was stored, but I would guess it probably used a non-binary representation with the upper word being (2**36-1) times the lower. Such an approach would be allowable for an extended signed integer type, but would not be allowable for an unsigned type. – supercat Jul 05 '18 at 18:50
@supercat Interesting! I did not know that. But we’re getting off-topic. – Davislor Jul 05 '18 at 19:00
@Davislor: My intended point was that the requirement that implementations must support a 64-bit unsigned data type made it impractical to produce a meaningfully-conforming C99 implementation on any existing sign-magnitude or ones'-complement hardware. As a consequence, concessions elsewhere in the Standard which were made to accommodate such machines serve no useful purpose unless weakening the language for no apparent reason is considered an "useful purpose". – supercat Jul 05 '18 at 19:07
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/174446/discussion-between-davislor-and-supercat). – Davislor Jul 05 '18 at 19:09

Wolf · Answer 5 · 2018-07-04T13:04:14.787

0

I'm assuming the code was compiled in C.
In C, 'a' is treated as an int type and int has a size of 4. In C++, 'a' is treated as a char type, and if you try compiling your code in cpp.sh, it should return 1.

edited Jul 04 '18 at 13:04

answered Jul 04 '18 at 12:53

Wolf

91
1
7

6

"_int has a size of 4_" Usually, yes. But not always. – Spikatrix Jul 04 '18 at 12:56
1

I have a platform where `sizeof(int)==1`. – pipe Jul 04 '18 at 19:55
1

@pipe What platform/compiler is that with `sizeof(int)==1`? – chux - Reinstate Monica Jul 04 '18 at 20:14
1

@chux: I believe Cray XMP used `CHAR_BIT == 32` so `sizeof(int) == 1`. Most people don't have one of those kicking around any more — or the power or water supply necessary to keep it happy. – Jonathan Leffler Jul 04 '18 at 22:39
@chux A custom core in a small embedded chip where `CHAR_BIT==16` and an int is 16 bits. – pipe Jul 05 '18 at 06:47
@chux: I've written code for a DSP where a `char` is a 16-bit signed integer type, and `int` is likewise. The only way the hardware could support changing an octet in memory would be to do a 16-bit load, change 8 bits of the loaded value, and then do a 16-bit store. It may have been possible for a compiler to generate code to process character-type writes with a read-modify-write sequence, but that would have made code that operates on a sequence of character-type values *really* slow. – supercat Jul 05 '18 at 18:53

Confusing behavior of sizeof with chars

5 Answers5