4

I'm working on implementing different APIs in C and C++ and wondered what techniques are available for avoiding that clients get the encoding wrong when receiving strings from the framework or passing them back. For instance, imagine a simple plugin API in C++ which customers can implement to influence translations. It might feature a function like this:

const char *getTranslatedWord( const char *englishWord );

Now, let's say that I'd like to enforce that all strings are passed as UTF-8. Of course I'd document this requirement, but I'd like the compiler to enforce the right encoding, maybe by using dedicated types. For instance, something like this:

class Word {
public:
  static Word fromUtf8( const char *data ) { return Word( data ); }
  const char *toUtf8() { return m_data; }

private:
  Word( const char *data ) : m_data( data ) { }

  const char *m_data;
};

I could now use this specialized type in the API:

Word getTranslatedWord( const Word &englishWord );

Unfortunately, it's easy to make this very inefficient. The Word class lacks proper copy constructors, assignment operators etc.. and I'd like to avoid unnecessary copying of data as much as possible. Also, I see the danger that Word gets extended with more and more utility functions (like length or fromLatin1 or substr etc.) and I'd rather not write Yet Another String Class. I just want a little container which avoids accidental encoding mixups.

I wonder whether anybody else has some experience with this and can share some useful techniques.

EDIT: In my particular case, the API is used on Windows and Linux using MSVC 6 - MSVC 10 on Windows and gcc 3 & 4 on Linux.

Frerich Raabe
  • 90,689
  • 19
  • 115
  • 207

3 Answers3

4

You could pass arround a std::pair instead of a char*:

struct utf8_tag_t{} utf8_tag;
std::pair<const char*,utf8_tag_t> getTranslatedWord(std::pair<const char*,utf8_tag_t> englishWord);

The generated machine code should be identical on a decent modern compiler that uses the empty base class optimization for std::pair.


I don't bother with this though. I'd just use char*s and document that the input has to be utf8. If the data could come from an untrusted source, you're going to have to check the encoding at runtime anyway.

JoeG
  • 12,994
  • 1
  • 38
  • 63
1

I suggest that you use std::wstring.

Check out this other question for details .

Community
  • 1
  • 1
radman
  • 17,675
  • 11
  • 42
  • 58
  • Yes, std::wstring looks like a candidate. However, I was wondering whether there is maybe something which doesn't require people to link their plugins against the standard C++ library. At least with Visual Studio 2009 it's not all inline template magic as far as I can see. – Frerich Raabe May 21 '10 at 13:23
  • 1
    Using std::wstring isn't a good idea. It's a sequence of wchar_t - which is a 16 bit integer type on Microsoft compilers and a 32 bit integer type on gcc. So a std::wstring could reasonably contain utf16LE, utf16BE, utf32BE or utf32LE. – JoeG May 21 '10 at 14:11
0

The ICU project provides a Unicode support library for C++.

jopa
  • 1,109
  • 7
  • 6