2

I tried to understand the internal struct of std::String using GDB , and I want to see if I understand that as well.

I have std::string object that contains the string AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA (32 A).

When I looking into GDB i see header: 0x00000020 0x00000020 0x00000000

data: 0x41414141 0x41414141 0x41414141 0x41414141 0x41414141 0x41414141 0x41414141 0x41414141

And when that object release with std::string::~string I see data same but ,header :

0x00000000 0x00000020 0xffffffff

Is that right ? the 0x20 is the size of string (why I see it twice? ) and when std::string object is release 0x00000000 replaced with 0xffffffff ?

I didn't understand that as well please

python3.789
  • 164
  • 9
  • Your understanding **may** be correct. There is no defined action or memory state after an object is deleted; the memory is in an undefined state. – Thomas Matthews May 09 '23 at 20:07
  • Which standard library implementation are you looking at? Every implementation can of course implement it however it wants as long as the standard's specification for `std::string` is satisfied. C++ doesn't have a reference implementation as e.g. Python does, nor is there typically any specification of memory layouts of standard library classes. – user17732522 May 09 '23 at 20:08
  • You might want to watch this CPPCON presentation: https://www.youtube.com/watch?v=kPR8h4-qZdk – alagner May 10 '23 at 04:08

2 Answers2

3

Depending on your library implementation, std::string is likely more complicated than expected because of short string optimization.

Paraphrasing from that answer, a simple implementation might store the size, capacity and a pointer to the data.

Simple

class string {
...
private:
    std::unique_ptr<char[]> m_data;
    size_type m_size;
    size_type m_capacity;

While a more performant implementation might store short strings as a size and a buffer directly in the structure and longer strings as size, capacity and a pointer to the data.

Optimized

class string {
...
private:
    size_type m_size;
    union {
        class {
            std::unique_ptr<char[]> m_data;
            size_type m_capacity;
        } m_large;
        std::array<char, sizeof(m_large)> m_small;
    };
};

It may be difficult to unravel all of this in a debugger.

You can also read the code directly depending on you library, e.g. libc++.

RandomBits
  • 4,194
  • 1
  • 17
  • 30
1

I look at the internal structure of library types in order to understand more about how a compiler performs it's magic. Especially container objects. Standard approach is to copy the object to a std::array that is the same size, then print the array in hex. This can be useful to explore exactly what happens when a container object is "moved" as well as learning how the different library coders implemented the container.

Here's the basic code adapted to std::string. Examined is how the object changes between an empty string, a string with the maximum, SSO, contents, and a string which requires that the string is stored in the heap.

SSO optimized strings requires the pointer to any character in the string be between the start and end of the string object.

#include <string>
#include <array>
#include <iostream>
#include <iomanip>
#include <cstdint>

void print_string_object(std::string& s)
{
    // Check that the size of a string object is a multiple of a pointer size
    static_assert(sizeof(uintptr_t) * (sizeof(s) / sizeof(uintptr_t)) == sizeof(s));

    // Create an array of uintptr_t that is the same size as a string
    using s_obj = std::array<uintptr_t, sizeof(s) / sizeof(uintptr_t)>;
    s_obj s_ptrs = *reinterpret_cast<s_obj*>(static_cast<void*>(&s));

    // Print details of string object in hex
    std::cout << "Address of Object\n  " << std::setfill('0') << std::setw(2*sizeof(uintptr_t)) << std::hex << &s << "\nObject\n";
    for (auto x : s_ptrs)
        std::cout << "  " << std::setfill('0') <<  std::setw(2 * sizeof(uintptr_t)) << std::hex << x << '\n';
}

int max_SSO(std::string &s)
{
    // return the maximum string stored in a string object (SSO)
    // and set s with bytes 0 1 2 3 ... until SSO is maxed out
    std::string s0;
    uintptr_t base = reinterpret_cast<uintptr_t>(&s0);
    uintptr_t top = reinterpret_cast<uintptr_t>(&s0) + sizeof(s0);
    for (int i = 0;; i++)
    {
        s0 += static_cast<char>(i);
        if (reinterpret_cast<uintptr_t>(&s0[0]) < base || reinterpret_cast<uintptr_t>(&s0[0]) >= top)
            return i;
        s += static_cast<char>(i);
    }
}

int main()
{
    std::string s;
    std::cout << "Capacity of empty string=" << s.capacity() << '\n';
    std::cout << "Empty string\n";
    print_string_object(s); // print details of null string 
    std::cout << "\nFull SSO string length=" << std::dec <<  max_SSO(s) << "\n";
    print_string_object(s); // print details of max SSO string 
    s += "0";
    std::cout << "\nDynamic memory string\n";
    print_string_object(s); // print details of dynamic allocated string 
}

And here's a link to compiler explorer for clang and gcc

MSVC output x64 is:

    Capacity of empty string=15
Empty string
Address of Object
  000000AF535AF840
Object
  0000000000000000
  0000000000000000
  0000000000000000
  000000000000000f

Full SSO string length=15
Address of Object
  000000AF535AF840
Object
  0706050403020100
  000e0d0c0b0a0908
  000000000000000f
  000000000000000f

Dynamic memory string
Address of Object
  000000AF535AF840
Object
  00000235a757e8a0
  000e0d0c0b0a0908
  0000000000000010
  000000000000001f

For MSVC, the first 16 bytes are used to store the SSO chars. This allows for a string length of 15 with the required terminating null char. When dynamic memory is required for longer strings, the first 8 bytes is a pointer to the chars stored in the heap. The last 2 entries are the current string size and maximum string size required before memory allocation is needed. GCC and CLANG have somewhat different layouts. CLANG, in particular allows SSO string sizes up to 22 chars and it's object size is 8 bytes less! Very efficient.

I've found the approach very useful for quickly understanding what is actually going on in library container code.

doug
  • 3,840
  • 1
  • 14
  • 18