0

I have a two part question:

  1. Understand output from sizeof
  2. Understand how strings are stored in variables (e.g. bits and ram)

Question 1

I'm trying to understand the output from the following piece of C code.

printf("a: %ld\n", sizeof("a")); // 2
printf("abc: %ld\n", sizeof("abc")); // 4

It always seems to be one larger than the actual number of characters specified.

The docs suggest that the returned value represents the size of the object (in this case a string) in bytes. So if the size of a gives us back 2 bytes, then I'm curious how a represents 16 bits of information.

If I look at the binary representation of the ASCII character a I can see it is 01100001. But that's only showing 3 bits out of 1 byte being used.

Question 2

Also, how do large strings get stored into a variable in C? Am I right in thinking that they have to be stored within an array, like so:

char my_string[5] = "hello";

Interestingly when I have some code like:

char my_string = "hello";
printf("my_string: %s\n", my_string);

I get two compiler errors:

- incompatible pointer to integer conversion initializing 'char' with an expression of type 'char [6]'
- format specifies type 'char *' but the argument has type 'char'

...which I don't understand. Firstly it states the type is presumed to be a size of [6] when there's only 5 characters. Secondly the mention of a pointer here seems odd to me? Why does printf expect a pointer and why does not specifying the length of the variable/array result in a pointer to integer error?

By the way I seemingly can set the length of the variable/array to 5 rather than 6 and it'll work as I'd expect it to char my_string[5] = "hello";.

I'm probably just missing something very basic/fundamental about how bits and strings work in C.

Any help understanding this would be appreciated.

Integralist
  • 5,899
  • 5
  • 25
  • 42
  • sizeof return a size_t, aka uintmax_t. Read [this](http://stackoverflow.com/questions/2524611/how-can-one-print-a-size-t-variable-portably-using-the-printf-family) to know the good flag. The string in C are finish by a '\0', so "a" == {'a', '\0'} so two char. `char my_string[5] = "hello";` is not valide you have no space for '\0'. – Stargateur Nov 20 '16 at 12:58
  • 2
    Post two separate questions. – John Zwinck Nov 20 '16 at 12:58
  • @Stargateur: `size_t` is not synonymous with `uintmax_t`. For example, on many 32-bit platforms, `size_t` is 32-bit, but `uintmax_t` may represent 64-bit integers (`unsigned long long`). – dreamlax Nov 20 '16 at 13:13
  • @dreamlax nope, uintmax_t is the max integer provider by the implementation. So in 32 bit, uintmax_t => uint32_t. [doc](https://en.wikibooks.org/wiki/C_Programming/C_Reference/stdint.h). The purpose of size_t is to handle a size, so a unsigned integer. uintmax_t is perfect to represent that. – Stargateur Nov 20 '16 at 13:19
  • `char my_string[5] = "hello";` will likely lead to buffer overruns. As, @JohnBode points out his his [answer](http://stackoverflow.com/a/40705260/2226988), let the compiler figure out what the size of the array should be. – Tom Blodget Nov 20 '16 at 17:56
  • @Stargateur: uintmax_t is usually set to `unsigned long long`, which even on 32-bit platforms represents a 64 bit integer, but `size_t` on 32-bit platforms is usually set to `unsigned long`, therefore they are **not synonymous**. – dreamlax Nov 20 '16 at 21:03
  • @Stargateur: It even mentions in the documentation that you linked to that the *minimum* required range for `size_t` is only [0-65535], whereas the minimum required range for `uintmax_t` is [0-2^64) – dreamlax Nov 20 '16 at 21:06

2 Answers2

1

The first part of the question is due to the way strings are stored in C. Strings in C are nothing more than a series of characters (char) with a \0 added at the end, which is the reason you're seeing a +1 when you do sizeof. Notice in your second part if you were to say char my_string[4] = "hello"; you'd also get a compiler error saying there wasn't enough size for this string. That's also related to this.

Now onto the second part, strings themselves are a series of characters. However, you don't store every character by themselves in a variable. You instead have a pointer to these series of characters that will allow you to access them from some part of memory. Additional information regarding pointers and strings in C can be found here: Pointer to a String in C

Community
  • 1
  • 1
SenselessCoder
  • 1,139
  • 12
  • 27
  • So just to be clear, does even a single character within a string `char x[2] = "a"` still create a variable that is assigned a pointer to the underlying array (x is assigned a pointer to `["a", "\0"]`)? – Integralist Nov 20 '16 at 13:42
  • 1
    Actually answered my own question there (doh), when I realised I'd need an array because of the null terminator – Integralist Nov 20 '16 at 13:43
1

In C, a string is a sequence of character values followed by a zero valued terminator. For example, the string "hello" is the sequence of character values {'h', 'e', 'l', 'l', 'o', 0 }1. Strings (including string literals) are stored as arrays of char (or wchar_t for wide-character strings). To account for the terminator, the size of the array must always be one greater than the number of characters in the string:

char greeting[6] = "hello";

The storage for greeting will look like

          +---+
greeting: |'h'| greeting[0]
          +---+
          |'e'| greeting[1]
          +---+
          |'l'| greeting[2]
          +---+
          |'l'| greeting[3]
          +---+
          |'o'| greeting[4]
          +---+
          | 0 | greeting[5]
          +---+

Storage for a string literal is largely the same2:

          +---+
 "hello": |'h'| "hello"[0]
          +---+
          |'e'| "hello"[1]
          +---+
          |'l'| "hello"[2]
          +---+
          |'l'| "hello"[3]
          +---+
          |'o'| "hello"[4]
          +---+
          | 0 | "hello"[5]
          +---+

Yes, you can apply the subscript operator [] to a string literal just like any other array expression.

Except when it is the operand of the sizeof or unary & operators, or is a string literal used to initialize a character array in a declaration, an expression of type "N-element of T" will be converted ("decay") to an expression of type "pointer to T", and the value of the expression will be the address of the first element of the array. So, the string literal "hello" is an expression of type "6-element array of char". If I pass that literal as an argument to a function like

printf( "%s\n", "hello" );

then both of the string literal expressions "%s" and "hello" are converted from "4-element array of char"3 and "6-element array of char" to "pointer to char", so what printf receives are pointer values, not array values.

You've already seen two exceptions to the conversion rule. You saw it in your code when you used the sizeof operator and got a value one more than you expected. sizeof evaluates to the number of bytes required to store the operand. Because of the zero terminator, it takes N+1 bytes to store an N-character string.

The second exception is the declaration of the greeting array above; since I'm using the string literal to initialize the array, the literal is not converted to a pointer value first. Note the you can write that declaration as

char greeting[] = "hello"; 

In that case, the size of the array is taken from the size of the initializer.

The third exception occurs when the array expression is the operand of the unary & operator. Instead of evaluating to a pointer to a pointer to char (char **), the expression &greeting evaluates to type "pointer to 6-element array of char", or char (*)[6].

The length of a string is the number of characters before to zero terminator. All the standard library functions that deal with strings expect to see that terminator. The size of the array to store that string must be at least one greater than the maximum length of the string you intend to store.


  1. Sometimes you'll see people write '\0' instead of a naked 0 to represent a string terminator; they mean the same thing.
  2. Storage for string literals is allocated at program startup and held until the program terminates. String literals may be stored in a read-only memory segment; attempting to modify the contents of a string literal results in undefined behavior.
  3. '\n' counts as a single character.

John Bode
  • 119,563
  • 19
  • 122
  • 198