9

I was looking at std::string::max_size and noticed the example:

#include <iostream>
#include <string>

int main ()
{
  std::string str ("Test string");
  std::cout << "max_size: " << str.max_size() << "\n";
  return 0;
}

with the output:

max_size: 4294967291

However, I always thought this limitation is due to the max value of an unsigned integer / size_t - so I kind of expected it to be 2^32 - 1 which would be 4294967295. Why is the max size in this example not using those 4 bytes?

I also tried to run the sample code, and on that machine it was 2^62 - which again confused me, why wouldn't it be 2^64 - 1 instead?

In general I am wondering, for what reasons would an implementation not use all the space?

Julius
  • 1,155
  • 9
  • 19
  • 2
    Implementation dependent, have you tried with different compilers and standard libraries? – Matthieu Brucher Feb 05 '19 at 14:21
  • 2
    I'd conject it's something to do with the fact that (1) `std::string` allows NUL characters, and (2) `size()` has to be O(1) so therefore the length has to be part of its payload and that would take 4 bytes for a string of that maximum theoretical length. Alternatively it might be a pointer used for short string optimisation. – Bathsheba Feb 05 '19 at 14:24
  • 2
    And clang answers `18446744073709551599`. So ask the different standard library implementers. You will get a better answer. – Matthieu Brucher Feb 05 '19 at 14:26
  • [std::string::max_size() will tell you the theoretical limit imposed by the architecture your program is running under](https://stackoverflow.com/a/3649675/10147399) Meaning it has no problem with growing as long as you have enough RAM – Aykhan Hagverdili Feb 05 '19 at 14:29
  • 2
    @Ayxan actually I disagree with that answer. The limit is not imposed by the architecture, but by the library implementation. Sometimes `std::string::max_size()` exceeds available memory, while it is well possible that on some architectures you have more memory available than the limit given by `max_size` – 463035818_is_not_an_ai Feb 05 '19 at 14:37
  • Maybe practicality and optimization purposes in terms of class sizes? Perhaps, if you have a string half as long as that... You probably wanted a `std::vector` – WhiZTiM Feb 05 '19 at 14:42
  • @user463035818 Yes, it is implementation depended, but usually it is all the available RAM, as most of the answers [here](https://stackoverflow.com/questions/3649639/limit-on-string-size-in-c) agrees. – Aykhan Hagverdili Feb 05 '19 at 14:43
  • @Ayxan what those asnwer agree on is that the practical limit of a strings size is your available RAM (this is the practical limit because typically `string::max_size` exceeds your RAM by far), the only answer on that Q/A that mentions `string::max_size` is the one you first linked and the statement it makes is strictly speaking wrong – 463035818_is_not_an_ai Feb 05 '19 at 14:47
  • 2
    @user463035818 Okay, then you're right. Thanks for correcting me, but I have to say that statements like "[The size of a string is only limited by the amount of memory available to the program, it is more of a operating system limitation than a C++ limitation](https://stackoverflow.com/a/3649691/10147399)" are very misleading indeed. – Aykhan Hagverdili Feb 05 '19 at 14:56
  • @Ayxan not sure why you think it is misleading. The practical maximum size of a string depends on your available ressources, something the language should not be concerned about too much, while `std::max_size` is a limitation of the language due to the type that is used for that index (minus some reserved values, see the answer) – 463035818_is_not_an_ai Feb 05 '19 at 15:04
  • 1
    Looking at old libstdc++ code, it explains a fancy computation for max_size, followed by "In addition, this implementation quarters this ammount." without any justification... – Marc Glisse Feb 05 '19 at 18:48

2 Answers2

7

One of the indices, the largest representable to be more specific, is reserved for the std::string::npos value, which represents a "not found" result in some string functions. Furthermore, the strings are internally null terminated, so one position must be reserved for the null termination character.

This brings us to a theoretical maximum of radix^bits - 3 that the standard library could provide (unless those reserved positions could be share the same value; I'm not 100% sure that would be impossible). Presumably the implementation has chosen to reserve two more indices for internal usage (or I've missed some necessarily reserved position). One potential usage for such reserved index that I could imagine might be an overflow trap, which detects accesses out of bounds.

From practical point of view: std::string::size_type is usually the same width as the address space, and under such assumption it's not practically possible to use the entire address space for a single string anyway. As such, the number reported by the library is usually not achievable; It is just an upper bound set by the standard library implementation and the actual size limit of a string is subject to limitations from other sources - most often by the amount of available RAM.

eerorika
  • 232,697
  • 12
  • 197
  • 326
  • I think they are lying actually; on a platform where the stack and heap are in the same address space, you need `sizeof(std::string)` to store the string itself. – Yakk - Adam Nevraumont Feb 05 '19 at 15:07
2

In addition to what eerorika wrote…

  • Strings can (and in multiple cases do) use "strange" layouts. E.g., prior to GCC 5's C++11-conformant string implementation, a std::string was implemented as a single pointer to a heap block(1) that contained the character data, and possible NUL terminator, starting at the pointed-to address, but that character data was prefaced with size, capacity and a reference count (for copy-on-write aka COW).
  • In general, there's only one way to know what the specific implementation is doing – looking at its source code.
  • Implementations are required to provide max_size() and incentivized to make max_size appear large enough for practical purposes. However, they often provide values that are unrealistically large. E.g., even the 2^32-5 figure seems absurd from a practical perspective on a 32-bit flat memory model, because it would assume that the entire rest of the program takes up 4 bytes or less (with one byte allotted for the string's NUL terminator). The 2^62 figure on AMD64 is equally absurd because even a hypothetical fully implemented long mode – i.e. requiring a future CPU – will "only" support 2^52 distinct physical addresses (technically, swapping or RAM compression could work, but is this really the intent?). BTW, the reason 2^62 may have been chosen as opposed to, let's say, 2^64 minus some small integer, is that the implementers at least realized that the kernel will always reserve part of the virtual address space for its own purposes.

Long story short… they have to provide a value, so they do, but they don't care enough to make it accurate and meaningful. At least you can assume that strings longer than max_size() are definitely impossible.

(1): Well, commonly – the statically allocated empty string being the physically tiny but conceptually big exception.

Arne Vogel
  • 6,346
  • 2
  • 18
  • 31