Why is size_t unsigned?

Question

Bjarne Stroustrup wrote in The C++ Programming Language:

The unsigned integer types are ideal for uses that treat storage as a bit array. Using an unsigned instead of an int to gain one more bit to represent positive integers is almost never a good idea. Attempts to ensure that some values are positive by declaring variables unsigned will typically be defeated by the implicit conversion rules.

size_t seems to be unsigned "to gain one more bit to represent positive integers". So was this a mistake (or trade-off), and if so, should we minimize use of it in our own code?

Another relevant article is Signed and Unsigned Types in Interfaces by Scott Meyers. To summarize, he recommends not using unsigned integers in interfaces, regardless of whether the value is always positive or not. In other words, even if negative values make no sense, you shouldn't necessarily use unsigned.

@Nicol: Because it's an unsigned that's used in interfaces, which Meyers recommends against, and Stroustrup seems to be saying it's not a good idea in the quote above. — Jon, Apr 16 '12 at 02:55
I'm not sure this question is fit for Stack Overflow, as I don't think there is a definitive answer. All I can say is that I agree with you. But apparently, 12 people agree with geekosaur. — Benjamin Lindley, Apr 16 '12 at 03:55
Alf's answer looks like it might be correct. People tend to use the fact that size_t is both standard and unsigned, and therefore they should use size_t or unsigned types in their own code. If the answer is something like "size_t is unsigned for historical reasons", then that would reduce that justification a bit. — Jon, Apr 16 '12 at 04:34
Do note that Stroustrup didn't create C. And at the early days space/performance optimizations were very important, or most people would never stop coding in assembly. — dbrank0, Apr 16 '12 at 07:42
Scott Meyers article is 199X-ish. is it already apply to C++11/4 ? — Zhen, Jan 16 '15 at 09:44
A relevant quote from Herb Sutter https://youtu.be/Puio5dly9N8?t=2660 : "Use int unless you need something different, then still use something signed until you really need something different, then resort to unsigned. And yes, it's unfortunately a mistake in the STL and the standard library that we use unsigned indices." — Jon, Jul 23 '15 at 09:58
I'd argue Meyers contradicts himself in that article. He writes: "Well-designed classes are easy to use correctly and hard to use incorrectly". Well, if a function takes a signed int which can only accept a positive value, then the function is *easy* to use incorrectly because its parameter type is telling the user that signed (and thus negative) values are welcome. But they are not--they will unconditionally lead to undesired behavior. Thus the function is easy to use incorrectly. — codesniffer, Dec 16 '18 at 12:46

score 69 · Accepted Answer · answered Apr 16 '12 at 04:00

69

size_t is unsigned for historical reasons.

On an architecture with 16 bit pointers, such as the "small" model DOS programming, it would be impractical to limit strings to 32 KB.

For this reason, the C standard requires (via required ranges) ptrdiff_t, the signed counterpart to size_t and the result type of pointer difference, to be effectively 17 bits.

Those reasons can still apply in parts of the embedded programming world.

However, they do not apply to modern 32-bit or 64-bit programming, where a much more important consideration is that the unfortunate implicit conversion rules of C and C++ make unsigned types into bug attractors, when they're used for numbers (and hence, arithmetical operations and magnitude comparisions). With 20-20 hindsight we can now see that the decision to adopt those particular conversion rules, where e.g. string( "Hi" ).length() < -3 is practically guaranteed, was rather silly and impractical. However, that decision means that in modern programming, adopting unsigned types for numbers has severe disadvantages and no advantages – except for satisfying the feelings of those who find unsigned to be a self-descriptive type name, and fail to think of typedef int MyType.

Summing up, it was not a mistake. It was a decision for then very rational, practical programming reasons. It had nothing to do with transferring expectations from bounds-checked languages like Pascal to C++ (which is a fallacy, but a very very common one, even if some of those who do it have never heard of Pascal).

answered Apr 16 '12 at 04:00

Cheers and hth. - Alf

142,714
15
209
331

4

I don't agree with the "bug attractors" part. C(++) is not the kind of language one should write in carelessly, making assumptions before reading and understanding a good detailed book on the language or the language standard itself. I don't think ignorance is a valid excuse for blaming a language feature. It's there, one must deal with it whether they want or not if they use it. There are more things about C(++) and other programming languages that are broken. Take floating point for example. Many start using it with all kinds of assumptions that are only valid in normal math. FP's a mistake? – Alexey Frunze Apr 17 '12 at 08:41
8

@Alex: I understand your feelings. Yet, the reason that we have strong type checking in C++, to the degree possible while keeping C compatibility, is that humans are fallible. There is even a very well known name for things going wrong when you just make it possible. – Cheers and hth. - Alf Apr 17 '12 at 11:24
14

All good compilers gives warning for `string( "Hi" ).length() < -3` but not for comparisons between two signed int; your life won't become easier had `size_t` been defined as signed, you will just make different kinds of errors. – Lie Ryan Oct 06 '12 at 15:47
3

I saw Java made a mistake not including unsigned type and make things like parsing `0xffffffff` or `0xffffffffffffffff` harder/slower or working with unsigned values on the net. Now they must introduce some functions to support unsigned operations to Java 8. – phuclv Aug 06 '15 at 05:30
6

This is very much a problem on 32-bit systems, too. You don't want to be limited to a 2GB size_t when you can address up to 4GB. – rustyx Jan 29 '17 at 11:58
3

@RustyX: It's not a problem. Only a single `char` array greater than 2GB is ruled out by using 32-bit signed `ptrdiff_t`. When that's been pointed out, some people have said that they often use such largish (relative to address space) `char` arrays. I don't believe them. Anyway, most 32-bit Windows programs are limited to 2GB. That worked well for a long time. – Cheers and hth. - Alf Jan 29 '17 at 12:07
@Cheersandhth.-Alf You can disbelieve all you want, but that doesn’t make it false. For one thing, it doesn’t need to be an array; other data structures also use `size_t`, and if the metadata:data ratio is small enough, can also be affected. Anybody who wants to read a file from disk, when the file might be greater than 2 GB (as a lot of files might be, especially if they aren’t plain text) wants more than 31 unsigned bits. – Daniel H May 26 '17 at 17:06
2

@DanielH and in how many cases would someone want to load a >2 GB file into an array of `char`s and sit there indexing into it? Would they not, instead, use streams, or pointers or iterators, or block-based loading/processing, or basically anything except array indexing? Note that I'm still not quite convinced by the '_`unsigned` is bad_' argument personally, but nor can I see how yours is a very useful counterargument against it. – underscore_d May 26 '17 at 21:37
@underscore_d There are a lot of things you can do if you memory-map a large file, such that it is visible in the program’s address space without actually all being read into RAM. This can be useful for anything from processing sound or video files to implementing a tool like `grep`, and is often the simplest method of loading data. This isn’t an argument against `signed`, but at least use `ptrdiff_t` so you really do only give up one bit. – Daniel H May 26 '17 at 23:47
Where does "17 bits" come from? – aschepler Jun 25 '17 at 14:39
@aschepler: Formally, via the required ranges. Regarding rationale, you need at least 17 bits for the signed type (`ptrdiff_t`) in order to directly represent all values of a 16-bit unsigned type (`size_t`). With 32 bit systems and higher it's no longer important to represent all values. – Cheers and hth. - Alf Jun 26 '17 at 17:21
Saying it's an accident of history is a cop-out. You acknowledge that embedded programming is still is a thing, but also suggest that it's not modern, which is amusing given the renaissance brought by Arduino and other platforms to make embedded 16-bit processors more popular and accessible. – Adrian McCarthy Aug 29 '17 at 18:52
1

@AdrianMcCarthy: 9.3% of respondents to SO's 2017 developer survey reported that they're doing embedded work. That's a lot, and very far from being "not modern". I doubt that Arduino development contributes significantly there. Regarding "accident of history", as you write, it was not. But what made sense back then doesn't make so much sense today. Possibly if the language was designed today there would be solutions that wouldn't add costs to desktop development just to support embedded: it's not necessarily either/or. – Cheers and hth. - Alf Aug 29 '17 at 19:55
1

I was attempting to refute _your_ claim that it's unsigned "for historical reasons." I don't see how `size_t` being unsigned adds costs for desktop development (I say this primarily as a desktop developer who also does embedded development). – Adrian McCarthy Aug 29 '17 at 20:58
1

Agree with @AlexeyFrunze and others here. I realize Stroustrup, Sutter, Meyers are big names in C/C++, but let's keep in mind this guidance was from the 90s when things were *very* different. In essence their guidance to use signed types is simply to help detect other problems. I'd argue this is the worst that can be done. If a programmer is making mistakes or using ill practices, no amount of checking in the library will help. Let their code fail and they will learn (possibly even that they're not suited to program in C/C++!). C/C++ gives plenty of rope to hang one's self. – codesniffer Dec 16 '18 at 12:55

score 23 · Answer 2 · edited May 23 '17 at 12:25

23

size_t is unsigned because negative sizes make no sense.

(From the comments:)

It's not so much ensuring, as stating what is. When is the last time you saw a list of size -1? Follow that logic too far and you find that unsigned should not exist at all and bit operations shouldn't be permitted either. – geekosaur

More to the point: addresses, for reasons you should think about, are not signed. Sizes are generated by comparing addresses; treating an address as signed will do very much the wrong thing, and using a signed value for the result will lose data in a way that your reading of the Stroustrup quote evidently thinks is acceptable, but in fact is not. Perhaps you can explain what a negative address should do instead. – geekosaur

edited May 23 '17 at 12:25

Community

1
1

answered Apr 16 '12 at 02:32

geekosaur

59,309
11
123
114

9

Isn't that exactly what Stroustrup was addressing when writing "Attempts to ensure that some values are positive by declaring variables unsigned..."? – Jon Apr 16 '12 at 02:44
1

It's not so much ensuring, as stating what *is*. When is the last time you saw a list of size -1? Follow that logic too far and you find that `unsigned` should not exist at all and bit operations shouldn't be permitted either. – geekosaur Apr 16 '12 at 02:50
1

More to the point: addresses, for reasons you should think about, are not signed. Sizes are generated by comparing addresses; treating an address as signed will do very much the wrong thing, and using a signed value for the result will lose data in a way that your reading of the Stroustrup quote evidently thinks is acceptable, but in fact is not. Perhaps you can explain what a negative address should do instead. – geekosaur Apr 16 '12 at 02:56
12

Stroustrup's (and Meyer's) point is that just because a value can never be negative, doesn't mean you should make it unsigned. For one, you can no longer detect erroneous negative values passed in interfaces (which are implicitly converted). – Jon Apr 16 '12 at 02:58
2

Again, please demonstrate how to apply this to machine addresses, which are the fundamental reason that `size_t` is `unsigned`. – geekosaur Apr 16 '12 at 03:02
5

Shouldn't that be your answer (size_t exists to compare addresses), rather than "negative sizes make no sense"? The latter seems to be in contradiction to what Stroustrup and Meyers stated. – Jon Apr 16 '12 at 03:15
4

@Jon: "you can no longer detect erroneous negative values " Nonsense. The C++ specification may say that the conversion is fine, but any compiler worth its salt will issue a warning. If you don't fix it, don't complain to the compiler about your screw-up. – Nicol Bolas Apr 16 '12 at 03:16
2

@Nicol, I'm talking about runtime errors. Compiler can't detect that. (see the Meyers link) – Jon Apr 16 '12 at 03:25
5

*"Follow that logic too far and you find that unsigned should not exist at all"* -- Maybe they shouldn't, they're mostly useless. At the very least, they should be completely avoided unless absolutely necessary. This is not a case where they are necessary. *"and bit operations shouldn't be permitted either"* -- I don't follow your logic here. – Benjamin Lindley Apr 16 '12 at 03:29
7

@Jon: The warning lets you know that the possibility of a runtime error exists and should be fixed. Again, if you fix it (by either making the function take a signed int, or by making sure that negative values cannot be passed in), there's no problem. And if you don't fix it, if you just do a cast to shut the compiler up, then you deserve what you get. – Nicol Bolas Apr 16 '12 at 03:31
2

@geekosaur: Following this answer's logic too far, you get "int dogs = 3;" vs "unsigned int dogs = 3;" Is the signed version wrong because negative dogs makes no sense? – Jon Apr 16 '12 at 03:35
9

@NicolBolas: My compiler gives no warning here: `size_t x = 0; for(size_t i=10; i>=x; --i) {}` -- Does yours? – Benjamin Lindley Apr 16 '12 at 03:37
2

Benjmin: "maybe they should not...": At least the authors of java seem to agree with this. :) – dbrank0 Apr 16 '12 at 07:39
2

"_It's not so much ensuring, as stating what is._" But `sizeof` something is **not** un-signed. It is a **positive** integer. It has a sign. – curiousguy Aug 15 '12 at 01:54
1

@BenjaminLindley: ""and bit operations shouldn't be permitted either" -- I don't follow your logic here". I guess the logic is that since bitwise operations on signed types in C++ are a bad idea, then if you also say unsigned types are a bad idea then you're left without any "good" way to do bitwise ops. Using signed types for bitwise ops is all very well in Java, where there's only one permitted representation for negative values. In C++ there are more. Perhaps unsigned types could have been left out if not for non-2's-complement. As it is, bitwise ops are a necessary use of unsigned types. – Steve Jessop Jan 17 '13 at 09:32
1

Just as an example, it is implementation-defined whether or not `~0` has defined behavior. In a ones' complement implementation that doesn't support negative zero, it generates a trap value. So non-2's-complement is a PITA as far as bitwise ops are concerned. – Steve Jessop Jan 17 '13 at 09:36
2

Re "using a signed value for the result [of comparing addresses] will lose data": the result of subtracting one pointer from another is of type `ptrdiff_t`, which *is* a signed integer type. – Cheers and hth. - Alf Jan 29 '17 at 12:15
Nit: There have been platforms with signed address spaces. I've haven't worked with them, but I think the Inmos Transputer was one. – Adrian McCarthy Aug 29 '17 at 19:00

score 3 · Answer 3 · answered Mar 20 '18 at 20:28

A reason for making index types unsigned is for symmetry with C and C++'s preference for half-open intervals. And if your index types are going to be unsigned, then it's convenient to also have your size type unsigned.

In C, you can have a pointer that points into an array. A valid pointer can point to any element of the array or one element past the end of the array. It cannot point to one element before the beginning of the array.

int a[2] = { 0, 1 };
int * p = a;  // OK
++p;  // OK, points to the second element
++p;  // Still OK, but you cannot dereference this one.
++p;  // Nope, now you've gone too far.
p = a;
--p;  // oops!  not allowed

C++ agrees and extends this idea to iterators.

Arguments against unsigned index types often trot out an example of traversing an array from back to front, and the code often looks like this:

// WARNING:  Possibly dangerous code.
int a[size] = ...;
for (index_type i = size - 1; i >= 0; --i) { ... }

This code works only if index_type is signed, which is used as an argument that index types should be signed (and that, by extension, sizes should be signed).

That argument is unpersuasive because that code is non-idiomatic. Watch what happens if we try to rewrite this loop with pointers instead of indices:

// WARNING:  Bad code.
int a[size] = ...;
for (int * p = a + size - 1; p >= a; --p) { ... }

Yikes, now we have undefined behavior! Ignoring the problem when size is 0, we have a problem at the end of the iteration because we generate an invalid pointer that points to the element before the first. That's undefined behavior even if we never try dereference that pointer.

So you could argue to fix this by changing the language standard to make it legit to have a pointer that points to the element before the first, but that's not likely to happen. The half-open interval is a fundamental building block of these languages, so let's write better code instead.

A correct pointer-based solution is:

int a[size] = ...;
for (int * p = a + size; p != a; ) {
  --p;
  ...
}

Many find this disturbing because the decrement is now in the body of the loop instead of in the header, but that's what happens when your for-syntax is designed primarily for forward loops through half-open intervals. (Reverse iterators solve this asymmetry by postponing the decrement.)

Now, by analogy, the index-based solution becomes:

int a[size] = ...;
for (index_type i = size; i != 0; ) {
  --i;
  ...
}

This works whether index_type is signed or unsigned, but the unsigned choice yields code that maps more directly to the idiomatic pointer and iterator versions. Unsigned also means that, as with pointers and iterators, we'll be able to access every element of the sequence--we don't surrender half of our possible range in order to represent nonsensical values. While that's not a practical concern in a 64-bit world, it can be a very real concern in a 16-bit embedded processor or in building an abstract container type for sparse data over a massive range that can still provide the identical API as a native container.

score 1 · Answer 4 · answered Jun 25 '17 at 14:34

On the other hand ...

Myth 1: std::size_t is unsigned is because of legacy restrictions that no longer apply.

There are two "historical" reasons commonly referred to here:

sizeof returns std::size_t, which has been unsigned since the days of C.
Processors had smaller word sizes, so it was important to squeeze that extra bit of range out.

But neither of these reasons, despite being very old, are actually relegated to history.

sizeof still returns a std::size_t which is still unsigned. If you want to interoperate with sizeof or the standard library containers, you're going to have to use std::size_t.

The alternatives are all worse: You could disable signed/unsigned comparison warnings and size conversion warnings and hope that the values will always be in the overlapping ranges so that you can ignore the latent bugs using different types couple potentially introduce. Or you could do a lot of range-checking and explicit conversions. Or you could introduce your own size type with clever built-in conversions to centralize the range checking, but no other library is going to use your size type.

And while most mainstream computing is done on 32- and 64-bit processors, C++ is still used on 16-bit microprocessors in embedded systems, even today. On those microprocessors, it's often very useful to have a word-sized value that can represent any value in your memory space.

Our new code still has to interoperate with the standard library. If our new code used signed types while the standard library continues to use unsigned ones, we make it harder for every consumer that has to use both.

Myth 2: You don't need that extra bit. (A.K.A., You're never going to have a string larger than 2GB when your address space is only 4GB.)

Sizes and indexes aren't just for memory. Your address space may be limited, but you might process files that are much larger than your address space. And while you might not have a string with more the 2GB, you could comfortably have a bitset with more than 2Gbits. And don't forget virtual containers designed for sparse data.

Myth 3: You can always use a wider signed type.

Not always. It's true that for a local variable or two, you could use a std::int64_t (assuming your system has one) or a signed long long and probably write perfectly reasonable code. (But you're still going to need some explicit casts and twice as much bounds checking or you'll have to disable some compiler warnings that might've alerted you to bugs elsewhere in your code.)

But what if you're building a large table of indices? Do you really want an extra two or four bytes for every index when you need just one bit? Even if you have plenty of memory and a modern processor, making that table twice as large could have deleterious effects on locality of reference, and all your range checks are now two-steps, reducing the effectiveness of branch prediction. And what if you don't have all that memory?

Myth 4: Unsigned arithmetic is surprising and unnatural.

This implies that signed arithmetic is not surprising or somehow more natural. And, perhaps it is when thinking in terms of mathematics where all the basic arithmetic operations are closed over the set of all integers.

But our computers don't work with integers. They work with an infinitesimal fraction of the integers. Our signed arithmetic is not closed over the set of all integers. We have overflow and underflow. To many, that's so surprising and unnatural, they mostly just ignore it.

This is bug:

auto mid = (min + max) / 2;  // BUGGY

If min and max are signed, the sum could overflow, and that yields undefined behavior. Most of us routinely miss this these kinds of bugs because we forget that addition is not closed over the set of signed ints. We get away with it because our compilers typically generate code that does something reasonable (but still surprising).

If min and max are unsigned, the sum could still overflow, but the undefined behavior is gone. You'll still get the wrong answer, so it's still surprising, but not any more surprising than it was with signed ints.

The real unsigned surprise comes with subtraction: If you subtract a larger unsigned int from a smaller one, you're going to end up with a big number. This result isn't any more surprising than if you divided by 0.

Even if you could eliminate unsigned types from all your APIs, you still have to be prepared for these unsigned "surprises" if you deal with the standard containers or file formats or wire protocols. Is it really worth adding friction to your APIs to "solve" only part of the problem?

"*And don't forget virtual containers designed for sparse data.*" And such containers will use a size/index type that's big enough for the data they can store. On a 32-bit system, they should still use 64-bit integers. Just like file APIs long since stopped using `int` for file sizes. Even C++17's Filesystem API doesn't rely on `size_t` for file sizes; it uses a `uintmax_t`. So that's still not a legitimate reason for `size_t` to be unsigned. — Nicol Bolas, Jun 25 '17 at 14:45
"*Do you really want an extra two or four bytes for every index when you need just one bit?*" How do I know that I only need one bit? If I truly know that my indices will *never* be larger than some size, then I can use an appropriate type. But if I have a table that needs to store any index which can appear in that table, then it needs to be able to store *any index*. Premature optimizations are premature. — Nicol Bolas, Jun 25 '17 at 14:46
@Nicol Bolas: The virtual containers example is specifically to counter the specific argument often made by the never-unsigned camp: that you'll never have a container with indexes that cover half of memory. — Adrian McCarthy, Jun 26 '17 at 17:37
@Nicol Bolas: "If I truly know that my indices will never be larger than some size, then I can use an appropriate type." Correct, and sometimes that appropriate type is unsigned. — Adrian McCarthy, Jun 26 '17 at 17:38
"*that you'll never have a container with indexes that cover half of memory.*" But that's not the argument. The argument is that you will never have such a container without *knowing* that you are writing such a container. It will never be `vector` or `deque` or whatever; it will always be a specific data structure that is explicitly designed to be gigantic in size. And therefore, you will use an index type that is appropriate to your container's expected size. — Nicol Bolas, Jun 26 '17 at 18:20
It won't be a vector or a deque, but it might want to provide a compatible API. — Adrian McCarthy, Jun 26 '17 at 20:03
The votes for this newer answer again show that the C++ community is divided into two parties of approximately equal size: those who say _always use signed integers_ and those (like me) who say: _use unsigned integers unless the value has a reason of being negative_. — prapin, Sep 01 '23 at 10:23

Why is size_t unsigned?

4 Answers4

Linked