9

What is the internal structure of std::wstring? Does it include the length? Is it null terminated? Both?

Jonathan Allen
  • 68,373
  • 70
  • 259
  • 447

3 Answers3

14

Does it include the length

Yes. It's required by the C++11 standard.

§ 21.4.4

size_type size() const noexcept;
1. Returns: A count of the number of char-like objects currently in the string.
2. Complexity: constant time.

Note however, that this is unaware of unicode.


Is it null terminated

Yes. It's also required by the C++11 standard that std::basic_string::c_str returns a valid pointer for the range of [0,size()] in which my_string[my_string.size()] will be valid, hence a null character.

§ 21.4.7.1

const charT* c_str() const noexcept;
const charT* data() const noexcept;
1. Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].
2. Complexity: constant time.
3. Requires: The program shall not alter any of the values stored in the character array.

Rapptz
  • 20,807
  • 5
  • 72
  • 86
  • Well I learned C# by reading the spec, I might as well do the same for C++. Where can I get a copy of it? – Jonathan Allen Jul 30 '13 at 07:15
  • 2
    @JonathanAllen I wouldn't learn from the standard, it's full of standardese so it'd be hard to read. However, you can find the C++14 CD [here](http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2013/n3690.pdf), along with multiple drafts and the standardisation process. – Rapptz Jul 30 '13 at 07:18
  • @JonathanAllen The best cost-free option is [open-std.org](http://www.open-std.org/JTC1/SC22/WG21/); you can't get the official version there, but you can get the latest draft which only differs in print-styling. It takes some searching to find which one is the newest-but-older-than-published, though. – Angew is no longer proud of SO Jul 30 '13 at 07:24
  • 1
    I don't know what you are comparing it to, but what you just posted is a heck of a lot easier to understand than the documentation on MSDN. – Jonathan Allen Jul 30 '13 at 07:26
  • 2
    @JonathanAllen If you want a good human-readable documentation I definitely support [this one](http://en.cppreference.com/w/Main_Page) – Rapptz Jul 30 '13 at 07:28
  • 1
    Not bad for a quick reference, but when learning something for the first time I prefer to read a book. Especially one that covers all of the nasty details. – Jonathan Allen Jul 30 '13 at 07:33
  • @JonathanAllen http://stackoverflow.com/questions/388242/the-definitive-c-book-guide-and-list has a list of books for learning C++. – Rapptz Jul 30 '13 at 07:34
  • There is no need for a string to have a null terminator until you call c_str(). – DanielKO Jul 30 '13 at 15:32
  • @DanielKO, how is c_str going to return a pointer to a null terminated string in constant time if that string doesn't already exist? It can't add a null terminator when c_str is called and it can't copy the original value in constant time. Therefore wstring must be null terminated internally. – Jonathan Allen Jul 30 '13 at 17:11
  • It could keep a buffer large enough to append a '\0' when needed. – DanielKO Jul 30 '13 at 17:12
  • Rule 3, "The program shall not alter any of the values stored in the character array." (Plus that's just silly. What would the buffers be initialized to? Zeros.) – Jonathan Allen Jul 30 '13 at 17:15
  • You are misreading rule 3. It says you, the user of the c_str() method, are not allowed to change anything in there (as in, can't const_cast it to charT*). And it's not silly, you might never need c_str() in a pure C++ program, so why waste time writing the null terminator? In particular, if you have SSO, and all your strings are 1 or 2 characters long. – DanielKO Jul 31 '13 at 19:36
  • The standard doesn't require that the length be stored as a field. For example, if you just stored `wchar_t` head and tail pointers and took the difference, you could get the length in constant time without having to store the length – SheetJS Aug 20 '13 at 14:36
  • @Nirk: The length is still included as part of the information in the structure, even if that information is encoded as the difference between a pair of pointers. What it's illegal for an implementation to do is, say, use a null-terminated string and then use `strlen`. – Puppy Aug 20 '13 at 18:27
11

We don't know. It's completely up to the implementation. (At least up until C++03 - apparently C++11 requires the internal buffer to be 0-terminated.) You can have a look at the source code of the C++ standard library implementation if the one you are using is opensource.


Apart from that, I'd find it logical if it was NUL-terminated and it stored an explicit length as well. This is good because then it takes constant time to return the length and a valid C string:

size_t length()
{
    return m_length;
}

const wchar_t *c_str()
{
    return m_cstr;
}

If it didn't store an explicit length, then size() would have to count the characters up to the NUL in O(n), which is wasteful if you can avoid it.

If, however, the internal buffer wasn't NUL-terminated, but it only stored the length, then it would be tedious to create a proper NUL-terminated C string: the string would have to either reallocate its storage and append the 0 (and reallocation is an expensive operation), or it would have to copy the entire buffer over, which is again an O(n) operation.

(Warning: shameless self-promotion - in a C language project I am currently working on, I've taken exactly this approach to implement flexible string objects.)

  • 4
    It is [guaranteed to be null-terminated in c++11](http://stackoverflow.com/questions/6077189/will-stdstring-always-be-null-terminated-in-c11). – Jesse Good Jul 30 '13 at 06:39
0

basic_string (from which wstring is typedef) has no need for terminators.

Yes, it manages its own lengths.

If you need a null-terminated (aka C string) version of string/wstring, call c_str(). But it can contain a null character inside it, in which case pretty much every C function to handle C strings will fail to see the entire string.

DanielKO
  • 4,422
  • 19
  • 29
  • 1
    I'm afraid this doesn't answer the question. OP is asking about the **internal implementation** of the string, he presumably is very well aware of the `.c_str()` member function and knows why and when to use it. Also, I hope you know about the wide-string handling functions in the C standard library, such as `wstrlen()`. –  Jul 30 '13 at 06:27
  • Actually I'm a journalist trying to write about how Platform::StringReference works in conjunction with wchar_t* and wstring. Apparently StringReference "requires a null terminated string of type (wchar_t* or wstring)" to work without creating a copy. Or perhaps he means "requires a (null terminated string of type wchar_t*) or wstring". Too bad spoken words don't have parens. – Jonathan Allen Jul 30 '13 at 07:13
  • 1
    Yes, I so didn't answer his question, the chosen answer gave the exact same three answers as me, one hour later. Maybe I should just write a long prose without addressing the question, or just quote the standard while misinterpreting it. – DanielKO Jul 30 '13 at 15:27