1

I want to use string encoded in the UTF-8 (I'm sorry if its a bad wording, please correct me so I understand what is a proper one). Also, I want my program to be cross-platform.

IIUC, the proper way to do so is to use std::wstring and then convert it to be UTF8. The trouble is that I think that on Linux std::string is already encoded in UTF8 (I may be wrong so).

So what is the best way to create a UTF8 representation of std::{w}string with the least possible conditional code?

The strings are constants, they are hard coded and they will be used in the SQLite queries.

P.S.: I am going to try with XCode 5, hoping that it is C++11 compliant.

Igor
  • 5,620
  • 11
  • 51
  • 103
  • 1
    What do you mean by "use"? – 一二三 Jan 31 '16 at 22:42
  • The encoding of a string is determined by the code that creates that string. Where are you getting these strings that you want to "use" in some unspecified fashion? And exactly how do you plan to "use" them? – Nicol Bolas Jan 31 '16 at 23:26
  • @一二三, the SQLite API will accept the query string encoded as UTF8 string, in order to support non-English table and database names. – Igor Jan 31 '16 at 23:38
  • 1
    @igor: you didn't answer nicol's question: where do these strings come from? User input? Command-line arguments? Hard-coded as string literals? Something else? – rici Feb 01 '16 at 00:11
  • Unfortunately, there is no useful Unicode support in standard C++. I believe the most common way to handle Unicode in C++ is ICU. – Baum mit Augen Feb 01 '16 at 00:28
  • @rici, they are hard coded. – Igor Feb 01 '16 at 00:42
  • @Igor: Then put that in your question. – Nicol Bolas Feb 01 '16 at 01:03
  • @Igor: It's generally not nice to change a question *fundamentally* after an answer has been posted. You risk invalidating existing answers, and indeed that's what you were now put up to by Nicol Bolas. Better to post a new question. – Cheers and hth. - Alf Feb 01 '16 at 01:14

3 Answers3

4

they are hard coded.

If all of the strings in question are hard-coded string literals, then you don't need anything special.

Use the u8 prefix when declaring such strings will ensure that they are encoded in UTF-8. On every platform that supports this feature of C++11. The type of such strings is const char [], just like a regular string literal:

const char my_utf8_literal[] = u8"Some String.";

Of course, these can be stored in std::string (not wstring) as well:

std::string my_utf8_string = u8"Some String.";

You said that your goal was to use them in SQLite queries and commands. In that case, it should be pretty easy to make everything work. You would be using SQLite's string formatting commands to build queries, and while they are blind to UTF-8, so long as all of your inputs are UTF-8, the outputs will also be valid UTF-8. So there shouldn't be any problems.

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
  • if I use "std::string my_utf8_string = u8"Some String";", I will still be able to use "my_utf8_string.c_str()", right? SQLite has C interface, so... – Igor Feb 01 '16 at 01:16
  • 1
    @Igor: Yes. Only the machinery that *interprets* the string in some way (e.g. character classification, filenames, i/o) is affected. – Cheers and hth. - Alf Feb 01 '16 at 01:17
0

For UTF-8 processing there's a Library called tiny-utf8. It provides a drop-in replacement for std::string or more specifically std::u32string (::value_type is char32_t, but data representation is utf8 with char's). That's more or less the easiest way to handle utf8 in C++11.

The strings are constants, they are hard coded and they will be used in the SQLite queries.

If you have hardcoded strings, you would just have to change the encoding of your source file to UTF8 and prepend the U-prefix to your string literal, with which you can then construct an utf8_string class to work with it.

So what is the best way to create a UTF8 representation of std::{w}string with the least possible conditional code?

IMHO If you are able to, don't work with wchar_t and wstring, since they are probably the most vaguely specified and platform specific things in the C++ string library.

I hope this helped at least a Little bit.

Cheers, Jakob

Jakob Riedle
  • 1,969
  • 1
  • 18
  • 21
-2

The question has changed after this answer was posted, adding that the strings are hardcoded literals to be used in SQL queries. For that simple u8 strings are a simple solution, and parts answered here become irrelevant. I'm not going to chase the question through this or further changes.

Re

I want to use string encoded in the UTF-8 (I'm sorry if its a bad wording, please correct me so I understand what is a proper one). Also, I want my program to be cross-platform.

Then you're plain out of luck.

Microsoft's documentation explicitly states that their setlocale does not support UTF-8:

MSDN docs on setlocale:

The set of available locale names, languages, country/region codes, and code pages includes all those supported by the Windows NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page value of UTF-7 or UTF-8, setlocale will fail, returning NULL.


Heads-up: in spite of the fact that It Does Not Work™, and is explicitly documented as not working, there are numerous web sites and blogs, probably even books, that recommend the approach, in a sort of ostrich-like way. They often look authoritative. But the info is rubbish.


Re

what is the best way to create a UTF8 representation of std::{w}string with the least possible conditional code?

That depends on what you have available. The standard library offers std::codecvt. It's been asked about and answered before, e.g. (Convert wstring to string encoded in UTF-8).

Community
  • 1
  • 1
Cheers and hth. - Alf
  • 142,714
  • 15
  • 209
  • 331
  • @CheersandhthAlf, how portable is std::codecvt? Also is it C++11 only? – Igor Jan 31 '16 at 22:36
  • @Igor: `std::codecvt` has been there since C++98, but the UTF-8 support wasn't added until C++11. That's very portable, since it's part of the standard library. – Cheers and hth. - Alf Jan 31 '16 at 22:56
  • 1
    What exactly does Windows not supporting the UTF-8 locale have to do with using UTF-8 in strings? It seems to me that this would only matter to code that's passing those strings to the Windows API, and cross-platform code, *by definition* is not doing so. – Nicol Bolas Jan 31 '16 at 23:27
  • @NicolBolas: `setlocale` is part of the standard C++ library, not the Windows API. The `setlocale` documented by Microsoft is the one supplied by their C and C++ runtime library, not Windows. `setlocale` affects a host of other standard library functions. – Cheers and hth. - Alf Jan 31 '16 at 23:33
  • @anonymous downvoter: please explain your downvote, so that it can more easily ignored by readers. thank you. – Cheers and hth. - Alf Jan 31 '16 at 23:36
  • 2
    OK, but that doesn't really change the question. Using UTF-8 strings has nothing to do with locales. Not unless you *want* them to. Indeed, the OP *never* even mentioned locales. – Nicol Bolas Jan 31 '16 at 23:37
  • @NicolBolas: People who are competent in this area are familiar with `setlocale`. You have just demonstrated that you're not. Yet you have strong opinions. – Cheers and hth. - Alf Jan 31 '16 at 23:39
  • 1
    That doesn't answer my question. I use UTF-8 just fine in projects that never even consider using C++'s crappy locale support. – Nicol Bolas Jan 31 '16 at 23:39
  • @Cheersandhth.-Alf, do you know if XCode 5 is C++11 compliant? Also, do you know if Linux has string as UTF8 encoded by default? – Igor Jan 31 '16 at 23:52
  • @Igor: Sorry, I don't know about XCode 5, but I should think so. Apple has pretty good engineers. :) Re Linux that's an in-practice thing, and for the in-practice the answer is "yes". I once asked about that on the [Ubuntu StackExchange](http://unix.stackexchange.com/questions/24529/most-common-encoding-for-strings-in-c-in-linux-and-unixc), nobody would commit to plain "yes", but it's there. However, in order to make wide streams work in Linux, if you want that, you'll have to call `setlocale(LC_ALL, "")`, or explicitly set an UTF-8 locale. – Cheers and hth. - Alf Feb 01 '16 at 00:15
  • @Cheersandhth.-Alf, sorry about that and thank you. You answer has a lot of useful information. Now some strings might not be hard coded in which case I will be using suggestion you gave. Thank you. – Igor Feb 01 '16 at 01:18