2

When it comes to internationalization & Unicode, I'm an idiot American programmer. Here's the deal.

#include <string>
using namespace std;

typedef basic_string<unsigned char> ustring;

int main()
{
    static const ustring my_str = "Hello, UTF-8!"; // <== error here
    return 0;
}

This emits a not-unexpected complaint:

cannot convert from 'const char [14]' to 'std::basic_string<_Elem>'

Maybe I've had the wrong portion of coffee today. How do I fix this? Can I keep the basic structure:

ustring something = {insert magic incantation here};

?

Patryk
  • 22,602
  • 44
  • 128
  • 244
John Dibling
  • 99,718
  • 31
  • 186
  • 324
  • Doesn't answer your question, but read this article on i18n: http://www.joelonsoftware.com/articles/Unicode.html – Starkey Sep 30 '10 at 20:36
  • 1
    You probably need to provide your own `char_traits` specialization. AFAIK, `` only provides specializations for `char` and `wchar_t`. – Praetorian Sep 30 '10 at 20:44
  • 1
    Would there be an issue using std::string instead? I gather you are using utf-8 so individual characters could end up negative (so to speak). If you eliminate the const and typecast the string as unsigned char* it will allow the assignment, but it doesn't look pretty. – Daryl Hanson Sep 30 '10 at 20:56
  • @Daryl: I'm using libxml which passes around a bunch of `unsigned char*`, so I think std::string is a no-go – John Dibling Sep 30 '10 at 21:19

2 Answers2

5

Narrow string literals are defined to be const char and there aren't unsigned string literals[1], so you'll have to cast:

ustring s = reinterpret_cast<const unsigned char*>("Hello, UTF-8");

Of course you can put that long thing into an inline function:

inline const unsigned char *uc_str(const char *s){
  return reinterpret_cast<const unsigned char*>(s);
}

ustring s = uc_str("Hello, UTF-8");

Or you can just use basic_string<char> and get away with it 99.9% of the time you're dealing with UTF-8.

[1] Unless char is unsigned, but whether it is or not is implementation-defined, blah, blah.

Steve M
  • 8,246
  • 2
  • 25
  • 26
  • @Steve, I know this is old, but I'm curious, when does `basic_string` not work for storing UTF-8 encoded strings? It is just storing a sequence of bytes which has never failed me yet. Is there a corner case I'm not aware of? – Matthew Sep 13 '17 at 19:57
1

Using different character types for a different encodings has the advantages that the compiler barks at you when you mess them up. The downside is, you have to manually convert.

A few helper functions to the rescue:

inline ustring convert(const std::string& sys_enc) {
  return ustring( sys_enc.begin(), sys_enc.end() );
}

template< std::size_t N >
inline ustring convert(const char (&array)[N]) {
  return ustring( array, array+N );
}

inline ustring convert(const char* pstr) {
  return ustring( reinterpret_cast<const ustring::value_type*>(pstr) );
}

Of course, all these fail silently and fatally when the string to convert contains anything other than ASCII.

sbi
  • 219,715
  • 46
  • 258
  • 445
  • Somehow I cannot use the third overload of `convert`. I get the following compile error: `error: cast from 'const char*' to 'std::__cxx11::basic_string::value_type {aka unsigned char}' loses precision [-fpermissive] return ustring( reinterpret_cast(pstr) );`. [coliru link](http://coliru.stacked-crooked.com/a/66b1d6c08a1ad63e) – Patryk Feb 22 '16 at 15:25
  • @Patryk: I believe I've fixed this now. Sorry I got this wrong so long ago. – sbi Feb 22 '16 at 15:28
  • 1
    That's what we have SO for :) – Patryk Feb 22 '16 at 15:33