2

Recently I was reading about the small string optimization (SSO): What are the mechanics of short string optimization in libc++?. As we know, a string typically consists of 3 pointers, which is 24 bytes on a 64 bit system. The linked answer says that in libc++'s implementation, the very first bit of the first pointer is used to indicate whether the string is in "long" or "short" mode, i.e. heap allocation and external storage vs internal storage of up to some 22 characters.

This however assumes however that the first bit of the first pointer cannot ever meaningfully be part of the address, because whenever the string is in "long" mode, that bit will always be set (or unset, depending which convention was chosen). This seems reasonable on its face, since with 64 bit pointers that allows 2^64 addresses, larger than 1 followed by 18 zeroes in bytes, or more than 1 billion gigabytes.

So this is reasonable, though not certain. My question is: is this guaranteed somewhere? And if it is guaranteed, where is it guaranteed? By the architecture spec, or by something else? To take it a step further: how many bits is it safe to do this with? I have a vague recollection reading somewhere that only 48 bits are used, but I don't recall.

If there are some number of bits, e.g. 8 or 16 that are guaranteed to be untouched, that is certainly something that could be leveraged in some interesting ways. It would be nice to exploit this, but not at the cost of having code mysteriously failure on some machine.

Community
  • 1
  • 1
Nir Friedman
  • 17,108
  • 2
  • 44
  • 72
  • Itanium and Alpha are very different platforms. About the best you can do is make statements about the memory layout used by the OS. – Flexo Jan 13 '16 at 20:32
  • @ Flexo Well, that is already a partial answer, if the Itanium spec has something to say about it then that's a start. – Nir Friedman Jan 13 '16 at 20:39
  • Already vote to close? Can I ask why? – Nir Friedman Jan 13 '16 at 20:39
  • 2
    It's more likely to the *least* significant bit, not the *most*. The long string buffer will be allocated from the heap, and will almost certainly be 8byte aligned (at least), so bottom four bits zero. This avoids making any assumptions about overall memory layout. – Martin Bonner supports Monica Jan 13 '16 at 20:41
  • 2
    You're mistaken in multiple ways. The short string flag doesn't overlap with a pointer at all, it overlaps with the field that holds the *size* of a long string. Even if it did overlap with a pointer, it would be the low bit of the pointer, meaning that memory *range* doesn't come into it, only alignment. – hobbs Jan 13 '16 at 20:42
  • @MartinBonner Thanks for pointing out the size. As for the alignment, I don't think this works out. std::string is a typedef of basic_string, so SSO is implemented in basic_string. basic_string is allocator aware, so it must work with any allocator. While the standard allocater will return 8 byte aligned memory, conforming allocators are not required to. Since it's allocating chars, it can return 1 byte aligned memory. – Nir Friedman Jan 13 '16 at 20:55
  • @hobbs See my comment above. – Nir Friedman Jan 13 '16 at 20:58
  • @NirFriedman still, it means that there's nothing germane in your question. – hobbs Jan 13 '16 at 21:07
  • @hobbs Well, actually what it means is that the motivating example was flawed. The question itself still stands, but obviously you are entitled to your opinion. – Nir Friedman Jan 13 '16 at 21:09

2 Answers2

4

As we know, a string typically consists of 3 pointers, which is 24 bytes on a 64 bit system.

This is not true with libc++. The __long structure, for "long strings" is defined as:

struct __long
{
    size_type __cap_;
    size_type __size_;
    pointer   __data_;
};

The short flag therefore goes into the capacity field, making the whole thing moot.

As for pointer tagging, there is no universal guarantee about the size of a pointer. On x86_64, the data structures that the CPU uses for virtual address translation only use 48 bits (or 52 with physical address extension), so virtual addresses never use the upper 16 (or 12) bits. Additionally, most operating systems map their kernel into every process and reserve some amount of the high end of the address space for it, so in practice, user-mode pointers are even more restricted. On Windows, the most significant hardware-usable bit of a pointer tells whether it belongs to kernel-space or user-space.

These limits can change in the future and will vary across platforms, so it would be bad form to use them in a platform-independent standard library. In general, it's much better practice to use the least-significant bits for pointer tagging, since your application is in control of these.

zneak
  • 134,922
  • 42
  • 253
  • 328
  • 2
    Actually, x86-64 has canonical addresses, so the upper 16 bits are copies of bit 47, so they *can* be set. – EOF Jan 13 '16 at 20:59
  • @EOF is this guaranteed then? If this is the case, then it sounds like it is actually possible to use those bits for something, and simply set the top 16 bits before dereferencing. – Nir Friedman Jan 13 '16 at 21:13
  • 1
    @NirFriedman: Yes, canonical pointers are enforced by hardware, you get an exception if you try to dereference a non-canonical address. However, if the virtual address-space is expanded in future processor generations, code relying on those bits being unused will not work anymore. I'm pretty sure canonical addresses are supposed to discourage this use. – EOF Jan 13 '16 at 21:17
3

The "long-bit" isn't part of a pointer, but of the capacity:

struct __long
{
    size_type __cap_;
    size_type __size_;
    pointer   __data_;
};

The "trick" is that if you always allocate an even number of characters and reserve one for the nul terminator, the resulting capacity will always be an odd number. And you get the 1-bit for free!

Bo Persson
  • 90,663
  • 31
  • 146
  • 203
  • Thanks for pointing this out, I stupidly assumed that string used 3 pointers like vector usually does, definitely a brain fart. I'm accepting zneak's answer as he added more information about the heart of my question, but thanks to you as well! – Nir Friedman Jan 13 '16 at 21:00