6

I want to develope an application in Linux. I want to use wstring beacuse my application should supports unicode and I don't want to use UTF-8 strings.

In Windows OS, using wstring is easy. beacuse any ANSI API has a unicode form. for example there are two CreateProcess API, first API is CreateProcessA and second API is CreateProcessW.

wstring app = L"C:\\test.exe";
CreateProcess
(
  app.c_str(), // EASY!
  ....
);

But it seems working with wstring in Linux is complicated! for example there is an API in Linux called parport_open (It just an example).

and I don't know how to send my wstring to this API (or APIs like parport_open that accept a string parameter).

wstring name = L"myname";
parport_open
(
  0, // or a valid number. It is not important in this question.
  name.c_str(), // Error: because type of this parameter is char* not wchat_t*
  ....
);

My question is how can I use wstring(s) in Linux APIs?

Note: I don't want to use UTF-8 strings.

Thanks

Amir Saniyan
  • 13,014
  • 20
  • 92
  • 137
  • 2
    Your initial example is wrong, `CreateProcess()` cannot be called with a `wchar_t*`, but only with a `_tchar*`. Also, wide characters don't immediately have anything to do with Unicode, and further, Windows wide characters are a crime against humanity. Perhaps [this rant of mine](http://stackoverflow.com/questions/6300804/wchars-encodings-standards-and-portability) explains a bit about wide characters and what to do with them. – Kerrek SB Sep 04 '11 at 14:16
  • Actually, the `CreateProcess` function do not exist, it's a macro for `CreateProcessW`/`CreateProcessA` depending on the settings of the `_UNICODE` macro. – Matteo Italia Sep 04 '11 at 14:18
  • Unicode noob here: wouldn't the easiest way to hack a working version together be to make a helper function that uses `wcstombs` and returns a `std::string` with the following usage `parport_open(0, toutf8(name).c_str(), ....);`? – user786653 Sep 04 '11 at 14:33
  • @user786653: and you would perform such conversion (that - by the way - needs to allocate memory every time) for each syscall just because you don't want to use UTF-8? – Matteo Italia Sep 04 '11 at 14:37
  • @Kerrek Windows wide characters are a crime against humanity? Well maybe, but are you suggesting not using them when coding on Windows? And how would you have implemented text on Windows NT? – David Heffernan Sep 04 '11 at 14:38
  • 1
    @David: I'm not saying that Windows NT isn't complicit in the crime :-) I understand the historic reasons for it, but ultimately it's lead to the propagation of a very troublesome idiom that's a lot of headache. I refer to my linked post: Use Unicode internally, and convert to wide-string or multibyte string at well-defined interfaces as needed, that should keep your code relatively clean, maintainable and portable. – Kerrek SB Sep 04 '11 at 14:52
  • @Kerrek: isn't it what is usually done - using internally UTF-8/16 (the one your platform likes most) and then converting to a definite encoding when it's needed? – Matteo Italia Sep 04 '11 at 14:56
  • @Matteo: I wouldn't say so: I'd either use UTF32 internally if I need to know the encoding, or `wchar_t` if I don't. I'd only use UTF-8 to interface with the environment, and UTF16 (explicitly named) never. (And I'd interface with the Windows API through wchars, without encoding.) – Kerrek SB Sep 04 '11 at 14:58
  • See [Why is UTF-8 used with Unix/Linux](http://stackoverflow.com/questions/164430/why-is-it-that-utf-8-encoding-is-used-when-interacting-with-a-unix-linux-environm/). – Jonathan Leffler Sep 04 '11 at 15:33

2 Answers2

5

Linux APIs (on recent kernels and with correct locale setting) on almost every distribution use UTF-8 strings by default1. You too should use them inside your code. Resistance is futile.

The wchar_t (and thus wstring) on Windows were convenient only when Unicode was limited to 65536 characters (i.e. wchar_t were used for UCS-2), now that the 16-bit Windows wchar_t are used for UTF-16 the advantage of 1 wchar_t=1 Unicode character is long gone, so you have the same disadvantages of using UTF-8. Nowadays IMHO the Linux approach is the most correct. (Another answer of mine on UTF-16 and why Windows and Java use it)

By the way, both string and wstring aren't encoding-aware, so you can't reliably use any of these two to manipulate Unicode code points. I heard that wxString from the wxWidgets toolkit handles UTF-8 nicely, but I never did extensive research about it.


  1. actually, as pointed out below, the kernel aims to be encoding-agnostic, i.e. it treats the strings as opaque sequences of (NUL-terminated?) bytes (and that's why encodings that use "larger" character types like UTF-16 cannot be used). On the other hand, wherever actual string manipulation is done, the current locale setting is used, and by default on almost any modern Linux distribution it is set to UTF-8 (which is a reasonable default to me).
Community
  • 1
  • 1
Matteo Italia
  • 123,740
  • 17
  • 206
  • 299
  • 1
    Arguing about whether UTF-8 is better than UTF-16 or not is somewhat moot (UTF-8 is better). Ultimately all that matters is that Windows uses UTF-16 and Linux uses UTF-8 and the rest of us have to follow suit. – David Heffernan Sep 04 '11 at 14:21
  • If the requirement is explicitly "Unicode", rather than "my platform's character set", you shouldn't use wide chars at all and instead switch to a modern compiler and use `char32_t`. – Kerrek SB Sep 04 '11 at 14:21
  • 2
    David: Actually, where does Linux "use" UTF-8? File systems, as one of the most important application, don't define any encodings at all, they just use null-terminated byte strings. I think it's only NTFS that for some reason uses null-terminated 16-bit strings (but also with no encoding semantics!). – Kerrek SB Sep 04 '11 at 14:23
  • @David: what I'm saying is that in my opinion, if Windows were to be rewrote today, probably it would make more sense to just use UTF-8 and forget about UTF-16 and the `W`/`A` stuff. But I'm with you that arguing about stuff that can't be changed is useless. – Matteo Italia Sep 04 '11 at 14:24
  • @Kerrek: AFAIK the Linux *kernel* try to be encoding agnostic as file systems. Where encoding is important the current locale setting should be used, and currently the default is UTF-8 almost everywhere. – Matteo Italia Sep 04 '11 at 14:25
  • @Matteo: Are there any "Linux API" functions though that require knowledge of encodings? What did you have in mind? – Kerrek SB Sep 04 '11 at 14:30
  • 3
    @Matteo Yes if Windows was re-written it would be done with char* and UTF8 encodings. What many people don't realise is that Windows supported Unicode before UTF-8 existed! – David Heffernan Sep 04 '11 at 14:31
  • @David: I know, and I actually wrote an answer about that. :) The point is that when these decisions were made in the NT project (early nineties, or even before?) Unicode was limited to the BMP, so it was a sensible decision to just use 2-byte `wchar_t`s, that had the advantage of 1 `wchar_t`=1 Unicode character - and UTF-8 was born in 1992 to accomodate the needs of UNIces. IIRC the extension to 2^32 characters was standardized in about 1994-1995 (NT4, Windows 95), so Microsoft never really had a chance to do things differently. – Matteo Italia Sep 04 '11 at 14:34
  • @Matteo: Nitpick: Unicode has room for 2^21 characters at present. UTF-8 can represent up to 2^36 values. – Kerrek SB Sep 04 '11 at 14:36
  • @Kerrek Instead of nitpicking and raging against wide char crimes against humanity, and confusing the heck out of the poor asker of the question, how about some practical advice? – David Heffernan Sep 04 '11 at 14:41
  • @Kerrek: just checked, you're right, only 17 planes are actually in use. – Matteo Italia Sep 04 '11 at 14:41
  • 1
    @David: Advice is in the post of mine that I linked :-) With the sheer quantity of confusion that surrounds encodings and wide characters, I think a bit of nitpicking is erring on the safe side... – Kerrek SB Sep 04 '11 at 14:46
  • 2
    @Kerrek: as currently defined, UTF-8 only uses 1-4 bytes to support the range of Unicode. That the coding scheme for UTF-8 could be extended further is undeniable, but the formal definition of UTF-8 precludes extended schemes (the bytes 0xF5..0xFF can never appear in valid UTF-8, any more than 0xC0 or 0xC1 can). See [Unicode FAQ on UTF-8, UTF-16, UTF-32 and BOM](http://www.unicode.org/faq//utf_bom.html) and links from it. – Jonathan Leffler Sep 04 '11 at 15:11
  • @Jonathan: You're right, I should have said that the Thompson scheme can be used for up to 36 bits, but for up to 21 bits it requires at most 4 code units. Thanks! – Kerrek SB Sep 04 '11 at 15:14
0

I don't want to use UTF-8 strings.

Well, you will need to overcome that reluctance, at least when calling the APIs. Linux uses single byte string encodings, invariably UTF-8. Clearly you should use a single byte string type since you obviously can't pass wide characters to a function that expects char*. Use string rather than wstring.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490