Using an illegal UTF-8 octet as a delimiter in a std::string

Question

Hopefully improved and more focused version of my question:

For reasons that would be misleading to explain (see below), I have to store several UTF-8 encoded strings within a single string. (String means a C++ std::string here)

My approach is to join the strings with one of the illegal UTF-8 octets (0xC0, 0xC1, 0xF5-0xFF) as a delimiter as these octets can never appear within a valid UTF-8 sequence. (As 0x00 is a valid UTF-8 octet, I think it's not appropriate for my intended misuse.)

All considerations regarding performance aside, are there any problems with this approach I'm not aware of? Is there any reason to prefer one of the illegal octets?

..

In my original question I tried to provide more context, but that lead to several questions about performance issues and the intended trade-off. But my question is not about those trade-offs, its just about how technically feasible and valid my approach is.

Are your strings dynamically modified or are they read-only? — Alexis Wilke, May 30 '19 at 16:31
@AlexisWilke The strings can be modified, that is be replaced by a modified string. I've extended my pseudo code showing how I intend to do that. — z80crew, May 30 '19 at 16:41
@z80crew: "*where I need fast access to single strings by their indices.*" That's not going to happen with your data structure as is. Your `split` function will have to iterate through the *entire* storage space (O(n)) to find the N-th element. That will not be "fast access". — Nicol Bolas, May 30 '19 at 16:47
To be honest I'm surprised this trade-off is worthwhile. Do you _really_ save that much memory? Meanwhile your new strings are _much, much, much_ harder to lookup (unless you're indexing their start positions, again requiring more memory - and making the delimiter pointless, so I guess not)! Plus modifying them will be a ballache. I think the question is too broad to answer directly, but the premise seems flawed. — Lightness Races in Orbit, May 30 '19 at 16:48
By the way, `"abc" + 0xC0 + "def" + 0xC0 + "ghi"` isn't string concatenation but you may know that already and may be ignoring it for the purposes of exposition — Lightness Races in Orbit, May 30 '19 at 16:49
@z80crew: "*for some important use cases it creates too much memory overhead*" can you clarify what those "important use cases" are which created "too much memory overhead"? — Nicol Bolas, May 30 '19 at 16:51
@z80crew: "*two-dimensional mutable array*" How exactly is this 2D array mutable? Is the dimensionality of the 2D array mutable, or is it just the strings *within* the array which are mutable? — Nicol Bolas, May 30 '19 at 16:53
IMO it would be better if you will explain what your application does (pls: edit question). Then we could validate your approach or propose a better solution. Now it looks like a [XY problem](http://xyproblem.info/). I'm not sure if this is really a question. — Marek R, May 30 '19 at 17:02
also instead using some crazy illegal UTF-8 sequence it is simpler faster and cleaner to just use a zero `0` as seprator. — Marek R, May 30 '19 at 17:05
Yeah, [zero's fine](https://stackoverflow.com/a/6907327/560648) (as long as you don't actually want to encode nulls!) — Lightness Races in Orbit, May 30 '19 at 17:16
I've of course tested the solution before asking, and yes it saves a lot of memory. In one use case memory usage declined from 7.5 GB to 3 GB, that's the difference of being able to run on a 8GB machine or not. So my question is not about the trade-off and whether it's worth it, it's about possible problems in obscure situations that I'm possibly overlooking. — z80crew, May 30 '19 at 22:16
`std::string` isn't going to care. It's just managing a sequence of `char`s, it will happily hold binary data if needed. Whatever processing you plan to perform on that string might care. — Igor Tandetnik, May 31 '19 at 03:31
The only issue I see is that if you use an illegal code then you will have to write the code that steps over it. Standard utf-8 parsers will be very unhappy! While '\0' is a legal code, this is true also of ascii at a fundamental level. Generally in both \0 is reserved as a delimiter anyways. The other option may be to consider one of the control codes from \001 to \03f. Some have specific purposes that you might encounter in real string data, but many are "junk" in common usage. — Gem Taylor, May 31 '19 at 21:48

Alexis Wilke · Accepted Answer · 2019-05-31T16:13:53.210

As others mentioned, using any byte that works in your situation will work just fine in an std::string. Although if your strings do not otherwise use '\0', it may be cleaner to use such rather than an illegal UTF-8 byte.

If your implementation is satisfactory in terms of speed, then I would imagine that's that. Otherwise, you could look into how databases are being managed. In that case you'd use buffers of a fixed size. The big advantage is that you would not break the memory in many small chunks and run in memory allocation problems later. Also speed wise, you would allocate those blocks once and re-use them many times. The malloc() and free() functions are expensive, especially if you have a tons of objects (new and delete operators call those functions.)

Now to save even more memory, since it sounds that is the main goal, and if possible in your situation, you could consider compressing your strings with zlib. I would use the fastest compression mode and see whether the resulting buffer is smaller, if yes, use it. Otherwise keep the uncompressed string. This requires you to save a size (4 bytes) per string. You can set the size to 0 when the buffer is not compressed.

One other things I'd like to mention is the fact that using an illegal byte will possibly be confusing to a future programmer maintaining that code base. No matter how many comments you have there, they will probably not read them anyway... you know... programmers just tend to read the code, not so much the comments. If it is something you are worried about, you could save your concatenated strings in a vector instead. Your split function would take a vector of char as input and return a vector of strings as its results.

Another possibility is to make use of swap memory through mmap(). This can be tedious, though, when handling dynamic data. This is where a database like scheme helps very much. You would allocate blocks (i.e. 64Kb at a time) and manage your data on a per block basis. When a string grows too big for the current block, move it to a new block... The advantage of this technique is that the data remains in memory unless the OS decides that it needs some of the RAM your software is using and it can swap it out an any time. To you, that swapping will be totally transparent. It also makes it much faster than hitting the default swap which has to manage your memory in a much less efficient manner.

Using an illegal UTF-8 octet as a delimiter in a std::string

1 Answers1