28

Understandably, going over a buffer errors out (or creates an overflow), but what happens if there are less than 12 bytes used in a 12 byte buffer? Is it possible or does the empty trailing always fill with 0s? Orthogonal question that may help: what is contained in a buffer when it is instantiated but not used by the application yet?

I have looked at a few pet programs in Visual Studio and it seems that they are appended with 0s (or null characters) but I am not sure if this is a MS implementation that may vary across language/ compiler.

hexadec0079
  • 323
  • 3
  • 5
  • 1
    `memset` can be used to ensure the buffer is initialized with zeros. – TruBlu Sep 17 '18 at 02:14
  • 4
    @TruBlu: Or in C++, `std::fill`. – MSalters Sep 17 '18 at 08:58
  • 4
    @TruBlu don't do that, i've seen lots of people do malloc followed by memset, or char foo[X] followed by memset, no good reason to. if you want them zero-initialized, use calloc() instead of malloc(), or use `char foo[x]={0};` and it will be zero-initialized. – hanshenrik Sep 17 '18 at 13:23
  • 1
    Define "buffer". In general, a 12-byte array is not a data structure that I would call a 12-byte buffer. – Tom Blodget Sep 17 '18 at 17:17
  • 2
    @hanshenrik Good to know. Thank you for providing optimized alternatives. `malloc()` is more efficient than `calloc()`, therefore it is the preferred method for allocating memory **unless** zero initialization is required. – TruBlu Sep 17 '18 at 19:20
  • @TruBlu you're spot on, that's exactly why malloc and calloc co-exists side-by-side (as opposed to deprecating 1 of them for the other) – hanshenrik Sep 18 '18 at 15:22

10 Answers10

18

Take the following example (within a block of code, not global):

char data[12];
memcpy(data, "Selbie", 6);

Or even this example:

char* data = new char[12];
memcpy(data, "Selbie", 6);

In both of the above cases, the first 6 bytes of data are S,e,l,b,i, and e. The remaining 6 bytes of data are considered "unspecified" (could be anything).

Is it possible or does the empty trailing always fill with 0s?

Not guaranteed at all. The only allocator that I know of that guarantees zero byte fill is calloc. Example:

char* data = calloc(12,1);  // will allocate an array of 12 bytes and zero-init each byte
memcpy(data, "Selbie");

what is contained in a buffer when it is instantiated but not used by the application yet?

Technically, as per the most recent C++ standards, the bytes delivered by the allocator are technically considered "unspecified". You should assume that it's garbage data (anything). Make no assumptions about the content.

Debug builds with Visual Studio will often initialize buffers with with 0xcc or 0xcd values, but that is not the case in release builds. There are however compiler flags and memory allocation techniques for Windows and Visual Studio where you can guaranteed zero-init memory allocations, but it is not portable.

selbie
  • 100,020
  • 15
  • 103
  • 173
  • 4
    "The remaining 6 bytes of `data` are undefined but will be something." But if they're undefined, isn't it undefined behaviour to try to ascertain that "something"? So it doesn't really matter, and the only solution is never to read uninitialised memory. It's not a case of "randomness" (especially not as an RNG); rather, I would say to assume uninitialised data are *poisonous*. There's probably an exception to this for reading `char` types, which can't have trap representations or padding, but it still wouldn't be meaningful or good code to get into a situation of reading the uninitialsed part. – underscore_d Sep 17 '18 at 07:08
  • 11
    "You should assume that it could be filled with random bytes." - This answer is good, but I want to object to the use of the word "random". Allocating memory and then reading it isn't a good source of randomness. – ymbirtt Sep 17 '18 at 08:09
  • The other "allocator" that zero-fills is for objects with static linkage. In the Unix parlance, they are allocated in BSS segment - *blank static storage*. The object file need only specify the position and length of these variables, because the run-time loader will fill them with zeros. Oh, and it might be worth mentioning that memory checkers exist that will detect attempts to read uninitialised memory - I normally recommend Valgrind, but there might be other choices on the Windows platform. – Toby Speight Sep 17 '18 at 08:09
  • 6
    "The remaining 6 bytes of data are undefined but will be something." - No, this is wrong! Accessing uninitialized values in undefined behavior. You cannot write code under the assumption that "well, there's something there, I don't care what, it doesn't matter". The optimizer may completely rearrange your code under the assumption that access to uninitialized values does not happen. – Sebastian Redl Sep 17 '18 at 08:32
  • `strncpy(data, "Selbie", 12)` would also zero-fill the rest of the buffer. – ilkkachu Sep 17 '18 at 08:57
  • Why initialise it with `0xcc`? – Lorraine Sep 17 '18 at 10:06
  • 2
    @Wilson It's an arbitrary value (it's easily to identify visually, though). [Different values have different meanings.](https://stackoverflow.com/a/370362/4083309) The reason is to give hints to the developer during debugging as to what went wrong (or simply, which variables have not yet been initialized). – Arne Vogel Sep 17 '18 at 10:57
  • Thanks for the constructive feedback everyone. I actually looked up the expected behavior in the C++ standard: "There are no constraints on the contents of the allocated storage on return from the allocation function. *The order, contiguity, and initial value of storage allocated by successive calls to an allocation function are **unspecified**.* " The keyword is **unspecified** and not **undefined**. And if I read this same doc correctly, **unspecified** == "left up to the implementation". So I would be hesitant to say that it's undefined behavior to read from an uninitialized buffer.... – selbie Sep 18 '18 at 04:08
  • … although it would be meaningless to do so unless the code was written with the expected implementation in mind. (e.g. a debug assert macro to validate that a buffer is initialized.) If you're coming to CppCon next week, we could all go up and ask Bjarne together what he thinks. – selbie Sep 18 '18 at 04:12
11

C++ has storage classes including global, automatic and static. The initialization depends on how the variable is declared.

char global[12];  // all 0
static char s_global[12]; // all 0

void foo()
{
   static char s_local[12]; // all 0
   char local[12]; // automatic storage variables are uninitialized, accessing before initialization is undefined behavior 
}

Some interesting details here.

Matthew Fisher
  • 2,258
  • 2
  • 14
  • 23
  • It's tiring to discuss this because misinformation abounds, but as far as the standard is concerned, `local` is not filled with random rubbish, it's filled with nasal demons. (Reading an uninitialized variable is complete UB.) – Arne Vogel Oct 02 '18 at 09:11
  • Updated to be more clear that automatic variable are undefined before initialization – Matthew Fisher Oct 02 '18 at 12:23
11

Consider your buffer, filled with zeroes:

[00][00][00][00][00][00][00][00][00][00][00][00]

Now, let's write 10 bytes to it. Values incrementing from 1:

[01][02][03][04][05][06][07][08][09][10][00][00]

And now again, this time, 4 times 0xFF:

[FF][FF][FF][FF][05][06][07][08][09][10][00][00]

what happens if there are less than 12 bytes used in a 12 byte buffer? Is it possible or does the empty trailing always fill with 0s?

You write as much as you want, the remaining bytes are left unchanged.

Orthogonal question that may help: what is contained in a buffer when it is instantiated but not used by the application yet?

Unspecified. Expect junk left by programs (or other parts of your program) that used this memory before.

I have looked at a few pet programs in Visual Studio and it seems that they are appended with 0s (or null characters) but I am not sure if this is a MS implementation that may vary across language/ compiler.

It is exactly what you think it is. Somebody had done that for you this time, but there are no guarantees it will happen again. It could be a compiler flag that attaches cleaning code. Some versions of MSVC used to fill fresh memory with 0xCD when ran in debug but not in release. It can also be a system security feature that wipes memory before giving it to your process (so you can't spy on other apps). Always remember to use memset to initialize your buffer where it matters. Eventually, mandate using certain compiler flag in readme if you depend on fresh buffer to contain a certain value.

But cleaning is not really necessary. You take a 12 byte-long buffer. You fill it with 7 bytes. You then pass it somewhere - and you say "here is 7 bytes for you". The size of the buffer is not relevant when reading from it. You expect other functions to read as much as you've written, not as much as possible. In fact, in C it is usually not possible to tell how long the buffer is.

And a side note:

Understandably, going over a buffer errors out (or creates an overflow)

It doesn't, that's the problem. That's why it's a huge security issue: there is no error and the program tries to continue, so it sometimes executes the malicious content it never meant to. So we had to add bunch of mechanisms to the OS, like ASLR that will increase probability of a crashing the program and decrease probability of it continuing with corrupted memory. So, never depend on those afterthought guards and watch your buffer boundaries yourself.

Agent_L
  • 4,960
  • 28
  • 30
  • 1
    You might want to add these precisions: Arrays with static durations are initialized to `0` before `main` is entered. Other arrays, either local values with automatic storage or allocated from the heap with `malloc()` have unspecified contents, reading this contents as bytes is OK, but has undefined behavior with most other types. Arrays allocated by `calloc()` are initialized to all bits zero which is subtly different from *initialized to `0`*. – chqrlie Sep 17 '18 at 15:50
4

The program knows the length of a string because it ends it with a null-terminator, a character of value zero.

This is why in order to fit a string in a buffer, the buffer has to be at least 1 character longer than the number of characters in the string, so that it can fit the string plus the null-terminator too.

Any space after that in the buffer is left untouched. If there was data there previously, it is still there. This is what we call garbage.

It is wrong to assume this space is zero-filled just because you haven't used it yet, you don't know what that particular memory space was used for before your program got to that point. Uninitialized memory should be handled as if what is in it is random and unreliable.

Havenard
  • 27,022
  • 5
  • 36
  • 62
  • Same applies here as above: Reading uninitialized memory results in undefined behavior, it is not filled with "random" values. – Arne Vogel Sep 17 '18 at 11:04
  • @ArneVogel Oh it's not random at all, but it *should be handled as if what is in it is random and unreliable*. – Havenard Sep 17 '18 at 21:15
3

All of the previous answers are very good and very detailed, but the OP appears to be new to C programming. So, I thought a Real World example might be helpful.

Imagine you have a cardboard beverage holder that can hold six bottles. It's been sitting around in your garage so instead of six bottles, it contains various unsavory things that accumulate in the corners of garages: spiders, mouse houses, et al.

A computer buffer is a bit like this just after you allocate it. You can't really be sure what's in it, you just know how big it is.

Now, let's say you put four bottles in your holder. Your holder hasn't changed size, but you now know what's in four of the spaces. The other two spaces, complete with their questionable contents, are still there.

Computer buffers are the same way. That's why you frequently see a bufferSize variable to track how much of the buffer is in use. A better name might be numberOfBytesUsedInMyBuffer but programmers tend to be maddeningly terse.

Doug Clutter
  • 3,646
  • 2
  • 29
  • 31
2

Writing part of a buffer will not affect the unwritten part of the buffer; it will contain whatever was there beforehand (which naturally depends entirely on how you got the buffer in the first place).

As the other answer notes, static and global variables will be initialized to 0, but local variables will not be initialized (and instead contain whatever was on the stack beforehand). This is in keeping with the zero-overhead principle: initializing local variables would, in some cases, be an unnecessary and unwanted run-time cost, while static and global variables are allocated at load-time as part of a data segment.

Initialization of heap storage is at the option of the memory manager, but in general it will not be initialized, either.

comingstorm
  • 25,557
  • 3
  • 43
  • 67
1

In general, it's not at all unusual for buffers to be underfull. It's often good practice to allocate buffers bigger than they need to be. (Trying to always compute an exact buffer size is a frequent source of error, and often a waste of time.)

When a buffer is bigger than it needs to be, when the buffer contains less data than its allocated size, it's obviously important to keep track of how much data is there. In general there are two ways of doing this: (1) with an explicit count, kept in a separate variable, or (2) with a "sentinel" value, such as the \0 character which marks the end of a string in C.

But then there's the question, if not all of a buffer is in use, what do the unused entries contain?

One answer is, of course, that it doesn't matter. That's what "unused" means. You care about the values of the entries that are used, that are accounted for by your count or your sentinel value. You don't care about the unused values.

There are basically four situations in which you can predict the initial values of the unused entries in a buffer:

  1. When you allocate an array (including a character array) with static duration, all unused entries are initialized to 0.

  2. When you allocate an array and give it an explicit initializer, all unused entries are initialized to 0.

  3. When you call calloc, the allocated memory is initialized to all-bits-0.

  4. When you call strncpy, the destination string is padded out to size n with \0 characters.

In all other cases, the unused parts of a buffer are unpredictable, and generally contain whatever they did last time (whatever that means). In particular, you cannot predict the contents of an uninitialized array with automatic duration (that is, one that's local to a function and isn't declared with static), and you cannot predict the contents of memory obtained with malloc. (Some of the time, in those two cases the memory tends to start out as all-bits-zero the first time, but you definitely don't want to ever depend on this.)

Steve Summit
  • 45,437
  • 7
  • 70
  • 103
  • Good point about `strncpy`: I am tempted to upvote for teaching users about a lesser known side effect, but also to dowvote for implicitly advocating the use of this error-prone function, too bad I cannot do both, so I shall do neither one. – chqrlie Sep 17 '18 at 15:53
1

It depends on the storage class specifier, your implementation, and its settings. Some interesting examples: - Uninitialized stack variables may be set to 0xCCCCCCCC - Uninitialized heap variables may be set to 0xCDCDCDCD - Uninitialized static or global variables may be set to 0x00000000 - or it could be garbage. It's risky to make any assumptions about any of this.

Tim Randall
  • 4,040
  • 1
  • 17
  • 39
1

Declared objects of static duration (those declared outside a function, or with a static qualifier) which have no specified initializer are initialized to whatever value would be represented by a literal zero [i.e. an integer zero, floating-point zero, or null pointer, as appropriate, or a structure or union containing such values]. If the declaration of any object (including those of automatic duration) includes an initializer, portions whose values are specified by that initializer will be set as specified, and the remainder will be zeroed as with static objects.

For automatic objects without initializers, the situation is somewhat more ambiguous. Given something like:

#include <string.h>

unsigned char static1[5], static2[5];

void test(void)
{
  unsigned char temp[5];
  strcpy(temp, "Hey");
  memcpy(static1, temp, 5);
  memcpy(static2, temp, 5);
}

the Standard is clear that test would not invoke Undefined Behavior, even though it copies portions of temp that were not initialized. The text of the Standard, at least as of C11, is unclear as to whether anything is guaranteed about the values of static1[4] and static2[4], most notably whether they might be left holding different values. A defect report states that the Standard was not intended to forbid a compiler from behaving as though the code had been:

unsigned char static1[5]={1,1,1,1,1}, static2[5]={2,2,2,2,2};

void test(void)
{
  unsigned char temp[4];
  strcpy(temp, "Hey");
  memcpy(static1, temp, 4);
  memcpy(static2, temp, 4);
}

which could leave static1[4] and static2[4] holding different values. The Standard is silent on whether quality compilers intended for various purposes should behave in that function. The Standard also offers no guidance as to how the function should be written if the intention if the programmer requires that static1[4] and static2[4] hold the same value, but doesn't care what that value is.

chqrlie
  • 131,814
  • 10
  • 121
  • 189
supercat
  • 77,689
  • 9
  • 166
  • 211
1

I think the correct answer is that you should always keep track of how many char are written. As with the low level functions like read and write need or give the number of character read or writen. In the same way std::string keep tracks of the number of characters in its implementatiin

izulh
  • 11
  • 2