How can I avoid encoding mixups of strings in a C/C++ API?

Question

I'm working on implementing different APIs in C and C++ and wondered what techniques are available for avoiding that clients get the encoding wrong when receiving strings from the framework or passing them back. For instance, imagine a simple plugin API in C++ which customers can implement to influence translations. It might feature a function like this:

const char *getTranslatedWord( const char *englishWord );

Now, let's say that I'd like to enforce that all strings are passed as UTF-8. Of course I'd document this requirement, but I'd like the compiler to enforce the right encoding, maybe by using dedicated types. For instance, something like this:

class Word {
public:
  static Word fromUtf8( const char *data ) { return Word( data ); }
  const char *toUtf8() { return m_data; }

private:
  Word( const char *data ) : m_data( data ) { }

  const char *m_data;
};

I could now use this specialized type in the API:

Word getTranslatedWord( const Word &englishWord );

Unfortunately, it's easy to make this very inefficient. The Word class lacks proper copy constructors, assignment operators etc.. and I'd like to avoid unnecessary copying of data as much as possible. Also, I see the danger that Word gets extended with more and more utility functions (like length or fromLatin1 or substr etc.) and I'd rather not write Yet Another String Class. I just want a little container which avoids accidental encoding mixups.

I wonder whether anybody else has some experience with this and can share some useful techniques.

EDIT: In my particular case, the API is used on Windows and Linux using MSVC 6 - MSVC 10 on Windows and gcc 3 & 4 on Linux.

@Anders: I updated my question to answer your comment. – Frerich Raabe May 21 '10 at 10:30 — Frerich Raabe, May 21 '10 at 10:30

JoeG · Accepted Answer · 2010-05-21T14:13:42.360

You could pass arround a std::pair instead of a char*:

struct utf8_tag_t{} utf8_tag;
std::pair<const char*,utf8_tag_t> getTranslatedWord(std::pair<const char*,utf8_tag_t> englishWord);

The generated machine code should be identical on a decent modern compiler that uses the empty base class optimization for std::pair.

I don't bother with this though. I'd just use char*s and document that the input has to be utf8. If the data could come from an untrusted source, you're going to have to check the encoding at runtime anyway.

score 1 · Answer 2 · edited May 23 '17 at 12:00

1

I suggest that you use std::wstring.

Check out this other question for details .

edited May 23 '17 at 12:00

Community

1
1

answered May 21 '10 at 11:35

radman

17,675
11
42
58

Yes, std::wstring looks like a candidate. However, I was wondering whether there is maybe something which doesn't require people to link their plugins against the standard C++ library. At least with Visual Studio 2009 it's not all inline template magic as far as I can see. – Frerich Raabe May 21 '10 at 13:23
1

Using std::wstring isn't a good idea. It's a sequence of wchar_t - which is a 16 bit integer type on Microsoft compilers and a 32 bit integer type on gcc. So a std::wstring could reasonably contain utf16LE, utf16BE, utf32BE or utf32LE. – JoeG May 21 '10 at 14:11

score 0 · Answer 3 · answered May 21 '10 at 11:55

0

The ICU project provides a Unicode support library for C++.

answered May 21 '10 at 11:55

jopa

1,109
7
6

True, but I'd rather not pull in a whole new library. – Frerich Raabe May 21 '10 at 13:17
Unless you need other functions it provides… – Steven R. Loomis May 24 '10 at 17:43

How can I avoid encoding mixups of strings in a C/C++ API?

3 Answers3