Internal Structure of Class string in C++

Question

What is the underlying structure of std::string in C++ ?

As far as I know, there are two different concepts:

1) Whole string is implemented with a char pointer (char*).

2) Some parts of the string are implemented with a static array. Its size is equal to 40, and if length of the string exceeds 40, dynamic memory is allocated.

Which one is correct ?

The good thing with STL is that it is based on headers, so you have the source and can check yourself how it is implemented on *this specific implementation*. — Werner Henze, Dec 06 '18 at 14:57
Be aware that `static` in C++ does not mean what you probably think it does - if there was a `static` array there would only be one instance of the array, shared between all strings. This obviously would not work. — , Dec 06 '18 at 15:01
Depends on implementation. Libstdc++ and Microsoft uses a static buffer 16 bytes long for short strings, which is overlapped with capacity info. Size of `std::string` is then 32 bytes (16 + string size + data pointer). Libc++ "unions" all string size/capacity/pointer with 24 char array, thus size of `std::string` is 24 bytes and capacity of small string is 22 chars (plus 1 byte null terminator plus 1 byte = bit tag and size of short strings). — Daniel Langr, Dec 06 '18 at 15:03
Is your question about the: [Why does Microsoft's implementation of std::string require 40 bytes on the stack?](https://stackoverflow.com/questions/40393350) — t.niese, Dec 06 '18 at 15:04
@Blaze I don't see how an internal array which is used up to a certain length violates the ContiguousContainer requirement. The container and iterator requirements seem to hold if the array is not dynamic. — Peter - Reinstate Monica, Dec 06 '18 at 15:05
Actually, I heart the second implementation in a community, and also this website presents some proofs about this: https://shaharmike.com/cpp/std-string/. However, There are also some people who say that the first one is the valid implementation. — Goktug, Dec 06 '18 at 15:06
@Goktug as NathanOlivier points out in his answer, `size()` and `capacity` member functions must have constant time complexity. Which implies, you have to store both somehow. Relevant section: http://eel.is/c++draft/string.capacity. — Daniel Langr, Dec 06 '18 at 15:08
@Goktug Keep in mind that *any* implementation which conforms to the standard -- especially the string (and corresponding iterator) API -- is a valid implementation. Even if one wouldn't believe it sometimes (partly because of the unfortunate "templates are headers" idiom), classes were designed to *hide* such details. As far as user code is concerned strings could acquire all their memory by "magic" (Arthur C. Clarke). — Peter - Reinstate Monica, Dec 06 '18 at 15:09

NathanOliver · Accepted Answer · 2018-12-06T15:18:52.560

5

1) Whole string is implemented with a char pointer (char*).

This is not a legal implementation. size() and capacity() and must be constant so you either need to store that information as pointer or integer variables.

2) Some parts of the string are implemented with a static array. Its size is equal to 40, and if length of the string exceeds 40, dynamic memory is allocated.

That array isn't a static member but this is legal since C++11 and is called small/short string optimization. One common way to implement this is

struct _internal
{
    char * start;
    char * end;
    char * cap;
};
union guts
{
    _internal ptrs;
    char arr[sizeof(_internal)];
}

and the string would be a wrapper around guts. This lets the array take up no more space than the pointer version but allows you to use the array untill you have more than sizeof(_internal) - 1 characters.

edited Dec 06 '18 at 15:18

answered Dec 06 '18 at 15:03

NathanOliver

171,901
28
288
402

Wait. What exactly prohibits a C++03 implementation to provide short string optimization? – Armen Tsirunyan Dec 06 '18 at 15:08
1

@ArmenTsirunyan See: https://stackoverflow.com/questions/22322487/c11-and-c03-differs-in-support-for-small-string-optimization-for-stdstring – NathanOliver Dec 06 '18 at 15:09
Are there typos in your fist paragraph or is it me? I can't really make sense of it – 463035818_is_not_an_ai Dec 06 '18 at 15:17
@user463035818 Just spotted it. Fixed. – NathanOliver Dec 06 '18 at 15:19
1

Just a note: your exemplary solution would need pointer tagging, which might be not easy to implement in a portable way. Libc++ therefore uses capacity member variable of integral type and uses its least significant bit (on little endian) as a tag of short/long string. If it's 0, then the string is short, which implies that the capacity of long strings is always odd. Remaining bits of LSB are used for short string size. Interesting approach is with `folly::FBString`, which stores remaining capacity of short strings and therefore LSB can serve as a null character terminator itself. – Daniel Langr Dec 06 '18 at 15:24

score 0 · Answer 2 · answered Dec 06 '18 at 15:08

I am quite sure that no implementation uses a static array, since that would not work, if two strings are allocated.

To use an fixed size array to improve memory handling is called short string optimization, but the c++ standard does only specified the interface and not the implementation, so it may vary.

The best you could do, is to take a look at your compilers implementation of std::string.

Internal Structure of Class string in C++

2 Answers2