Partially Prefixed Null Terminated Strings

Question

Whether to use NULL Terminated strings (SZ) or Length Prefixed strings (LPS) seems to be a hot button topic. Indeed, we even have a question on that topic here.

So something occurred to me. Could we use a bit of both? I mean, obviously, you can't make a LPS & SZ (LPSZ?) string without removing some of the great things about one or the other.

However, the biggest complaint about SZ seems to be the time required for operations, due to string length measurements taking so long. I had an idea about that:

Make all strings SZ. However, we could also store the length of the string mod 256 (ie: len % 256) in a single byte preceding the string. While it would not reduce the complexity of the operation, it could potentially increase the speed substantially, at the cost of a single extra byte.

Here's what the main advantages of this scheme would be:

Not restricted to any particular size
Faster than normal SZ (upto 256x)
Compatible across all different memory sizes
Small strings are still quite space efficient (no wasted bytes in len)
No endianness issues
Portable across machines (see 3 & 5)

This is what strlen () would look like under this scheme (obviously, you would name it something else, because you would be making a different library):

size_t ppsz_strlen (const char *s) {
    // get LPS
    size_t = ((uint8_t *) s)[-1];


    // check for SZ across every 256 byte interval
    for (x = x; s[x]; x += 256)
        continue;

    return x;
}

A suitable name might be PPSZ (Partially Prefixed Null Terminated Strings).

To me, this seems to be a reasonable tradeoff: one byte for a fairly large acceleration. Of course, someone might ask: why not two, or four, or eight bytes? My answer would be that most strings in programs don't get too big, where skipping ahead by 65536, 16777216, 2 ** 32, or 2 ** 64 becomes too valuable. With a few of those cases, it might actually be a good time to consider splitting up strings. Especially if one string is overflowing beyond the size of a 64-bit addressing space.

Anyways though, I was wondering if anyone else had some ideas regarding the concept. I'm positive there's probably something I'm missing out on as to why I haven't seen the concept in practice before.

Thanks for any suggestions or problems found!

EDIT: removed the background information to make the intent of the question more obvious (are there any potential flaws I may have missed, etc. in my algorithm)

note: this algorithm is mainly intended to help with the speed issues in SZ

You missed some - null-terminated arrays cannot be strings because they cannot contain a complete character set, (eg. nul char in ASCII). Prefixed string types have the option of storing other stuff in the prefix, eg. a reference count, read-only flag and/or a mutex/CS object for thread-safety. — Martin James, Sep 30 '14 at 10:58
Just looking through the 'sockets' tag demonstrates the apalling mess that can be generated by misunderstanding C 'strings'. Calling strlen() and/or printf(%s..) on char buffers that are not guaranteed to be null-terminated happens ~every day in that tag. — Martin James, Sep 30 '14 at 11:04
I understand null termination perfectly. C doesn't do strings per se, rather just arrays of integers. The string related functions are only intended for use with SZ, and I can see how not getting that might cause issues. My main idea here was to just enhance SZ a tiny bit - and I wanted to see if there are any other issues I might not have seen with my idea. — haneefmubarak, Sep 30 '14 at 11:08
Thoughts 1) `size_t = ((unsigned char *) s)[-1];` not rely on `uint8_t` existence and to take advantage of larger `char`. Then use `x += 1u << CHAR_BIT` 2) It is at least a novel idea. +1 for that 3) think you wanted `for (x = 0;` 4) IMO, The weakness of not allowing all `char` such as a `'\0'` mid-way in a C string is not solved here and since this change, though helpful does not solve that, renders this approach an insufficient improvement. — chux - Reinstate Monica, Sep 30 '14 at 16:24
I'd suggest a better idea would be to have a prefix which contains bytes with the MSB set followed by one with the MSB clear. The allocation length would be half the value represented the lower bits in each byte, interpreted as a base-128 number. The very bottom bit would indicate either that the string length and allocation length were equal, or that the difference between actual and allocated length was stored in the last byte(s) of the buffer, in similar format. A leading-byte value of "1" [zero characters allocated, but not all used] would indicate that... — supercat, Mar 05 '15 at 20:50
... a "string pointer" actually pointed to `struct INDIRECT_STRING {char header; char *data; size_t allocated_length; size_t current_length;}`. Such an approach would, without tweaking, be efficient on systems with under 1K of RAM, or with thousands of gigabytes. — supercat, Mar 05 '15 at 20:58

Partially Prefixed Null Terminated Strings

EDIT: removed the background information to make the intent of the question more obvious (are there any potential flaws I may have missed, etc. in my algorithm)

note: this algorithm is mainly intended to help with the speed issues in SZ

0 Answers0