Handle std::basic_string<> with different type arguments

Question

I want to implement a c++ library, and like many other libs I need to take string arguments from the user and giving back strings. The current standard defines std::string and std::wstring (I prefer wstring). Theoretically I have to implement methods with string arguments twice:

virtual void foo(std::string &) = 0; // convert internally from a previous defined charset to unicode
virtual void foo(std::wstring &) = 0;

C++0x doesn't make life easier, for char16_t and char32_t I need:

virtual void foo(std::u16string &) = 0;
virtual void foo(std::u32string &) = 0;

Handle such different types internally - for example putting all into a private vector member - requires conversion, wrappers... it's horrible.

Another problem is if a user (or myself) wants to work with custom allocators or customized trait classes: everthing results in a completely new type. For example, to write custom codecvt specializations for multibyte charsets, the standard says I have to introduce a custom state_type - which requires a custom trait class which results in a new std::basic_ifstream<> type - and that's completely incompatible to interfaces expecting std::ifstream& as an argument.

One -possible- solution is to construct each library class as a template that manages the value_type, traits and allocators specified by the user. But that's overkill and makes abstract base classes (interfaces) impossible.

Another solution is to just specify one type (e.g. u32string) as default, every user must pass data using this type. But now think about a project which uses 3 libraries, and the first lib uses u32string, the second lib u16string and the thirth lib wstring -> HELL.

What I really want is to declare a method just as void foo(put_unicode_string_here) - without introduce my own UnicodeString or UnicodeStream class.

Within your application you should use **ONE** string class consistently (everyting should eb that type (one of the above)). This is your internal string representation. The interface into your application may take multiple different types of input but this input is always converted into the internal string representation before it is passed out of the interface layer. — Martin York, Oct 17 '10 at 18:51
I am not sure what you are referring to in relation to the codecvt. It seems relatively straight forward to write them. See here for simple usage pattern: http://stackoverflow.com/questions/207662/writing-utf16-to-file-in-binary-mode/208431#208431 — Martin York, Oct 17 '10 at 18:56
If you have to use the state_type (which is mbstate_t in your example) in your codecvt implementation, you will see that you can't use mbstate_t because it's implementation is only known to the developer of your c++ stl. You need to introduce your own state_type and then you specialize codecvt for . That's what the standard tells you for custom codecvts. Deriving from std::codecvt was described in stroustrups book, but does not cover situations where state_type must be used. — cytrinox, Oct 17 '10 at 19:26
@martin as response to your first comment: The problem is the conversion from the input to the internal representation. How do you convert between easily between basic_string and std::wstring? how to convert between basic_ifstream and std::wifstream? And remember, some methods may want a reference to std::wstring & Co. That means that the user (or the lib developer if conversion is internally done) have to manage the origin object and the converted object (which is passed to the lib). — cytrinox, Oct 17 '10 at 19:42
@cytrinox: I still don't understand what the problem is with codecvt. As I said it is relatively straight forward even when you use the state type. Maybe it would be best if you ask a question exactly about this subject so you can get advice on implementing it correctly. — Martin York, Oct 17 '10 at 21:35
With streams I still don;t see the problem.You have defined the internal representation. DONE. No arguments. One type for string through your code. Now you have the interface between your code and the rest of the world. The interface takes objects of the correct type or objects that can be converted to the correct type then converts them and passes them to the internal code. Easy. Done. — Martin York, Oct 17 '10 at 21:38
Well, I don't need support for building custom codecvts. But if you tell me that using mbstate_t is easy, then please tell me how mbstate_t is defined (standard reference, pageno. or anything else). — cytrinox, Oct 17 '10 at 21:50
@Martin: The codecvt interface is indeed broken. `mbstate_t` could be `char` for all the user knows. For proper extensibility, it needs to be an abstract base class. The `char_traits` interface is indeed pretty broken. Essentially the language designers threw it together without respect for individual use-cases. Ideally "one size fits all," but if that were so, there would be no need for traits! — Potatoswatter, Apr 29 '11 at 10:13

rubenvb · Answer 1 · 2010-10-17T17:27:11.067

1

There is always choice that has to be made if you don't want to support everything, but I personnally feel restricting input to UTF-8 is the easiest of all. Just use plain old std::string and everyone's happy. In practice, the user (of your library) will only have to convert to UTF-8 if he's on Windows, but there's a plethora of ways to do that simple task.

UPDATE: on the other hand, you could template all of your code and leave the std::basic_string<T> as a template throughout your code. This only gets messy if you do different things dependent on the size of the template argument.

edited Oct 17 '10 at 17:27

answered Oct 17 '10 at 17:21

rubenvb

74,642
33
187
332

The first solution can't handle custom traits or allocators. And the second one is totally overkill. – cytrinox Oct 19 '10 at 15:15
See @Martin's responses, he hits the nail on the head... You can't cater for *every* possible type of string variation, just stick to the standard, no user will care what internal representation you use as far as I can see... – rubenvb Oct 19 '10 at 16:08
The "problem" is not to specify one internal representation. The problem is the convertion to the internal representation from various basic_strings and basic_streams. How to design/write an interface which accepts both templates with various arguments and convert them internally to the internal representation? – cytrinox Oct 19 '10 at 17:55
@cytronix: accepting all sorts of strings with all kinds of allocators and traits and byte-sizes is going to be cumbersome and IMHO should be left up to the user of your library. – rubenvb Oct 19 '10 at 19:20
Okay, how can the user easily do this? Say I define a method that requires a std::wstreambuf* as argument and puts some data into - and the user has a basic_streambuf object? – cytrinox Oct 19 '10 at 19:33

Potatoswatter · Answer 2 · 2011-04-29T11:04:53.437

char_traits is indeed a hopelessly awful wastebin of random traits. Should every string pre-specify the largest supported file size, case-sensitivity, and (ugh) state type of the encoding mechanism itself? NO.

However, what you ask is impossible even with well-designed traits. string and wstring are meaningfully different because the size of the internal character type differs. To run any kind of algorithm, you will need to query the object for char_t. That requires RTTI or virtual functions because basic_string doesn't (and shouldn't) maintain that info at runtime.

One -possible- solution is to construct each library class as a template that manages the value_type, traits and allocators specified by the user. But that's overkill and makes abstract base classes (interfaces) impossible.

This is the only complete solution. Templates actually do play well with abstract base classes: a number of templates can derive from a non-template abstract base, or the base can also be templated. However, it is difficult if not untenable because of the sensitivity and tedium of writing perfectly generic code.

Another solution is to just specify one type (e.g. u32string) as default, every user must pass data using this type. But now think about a project which uses 3 libraries, and the first lib uses u32string, the second lib u16string and the thirth lib wstring -> HELL.

This is why I'm scared by C++11's "improved" Unicode support. It simplifies direct interaction with file data and discourages abstraction to a common wchar_t internal format. It would have been better to require specific codecvts for UTF-16 and UTF-32 and specify that wchar_t must be at least 21 bits. Whereas before there were only "dumb" char and "smart" wchar_t libraries among clean C++ interfaces, we may have to contend with additional widths — and char16_t is just an instant red flag.

But, that's down the road.

If you really end up using a number of incompatible libraries, and the problem is shuttling data between functions requiring different formats, then write a ScopeGuard-style utility to convert from and back to your chosen common format, such as wstring. This utility can be a template with an explicit specialization for each incompatible format you need, or a non-templated set of classes.

Handle std::basic_string<> with different type arguments

2 Answers2