16

Is it possible to have char *s to work with utf8 encoding in C++ (VC2010)?

For example if my source file is saved in utf8 and I write something like this:

const char* c = "aäáéöő";

Is this possible to make it utf-8 encoded? And if yes, how is it possible to use

char* c2 = new char[strlen("aäáéöő")];

for dynamic allocation if characters can be variable length?

sekmet64
  • 1,635
  • 4
  • 19
  • 30
  • Does VS2010 supports C++0x? If so you can try `u8` described here: http://en.wikipedia.org/wiki/C%2B%2B0x#New_string_literals – Naveen May 20 '11 at 13:12
  • seems like it isn't implemented or maybe I would need some compiler parameters. – sekmet64 May 20 '11 at 13:14
  • I don't think it supports those literals. It only implements choice features. – Skurmedel May 20 '11 at 13:16
  • 1
    No, VS2010 implements 5 features from C++0x : lambda, r-value reference, auto/decltype (type inference), nullptr. It have been done far before the C++11 draft have been fixed (last months). – Klaim May 20 '11 at 13:16
  • Ah, yes the lambdas and nullptr. There is some nice new library stuff though. – Skurmedel May 20 '11 at 13:19
  • Those characters are not variable length. If they were, and one byte happened to be 00, strlen wouldn't work. – Bo Persson May 20 '11 at 13:26
  • 2
    @Bo, UTF-8 encoding guarantees that no encoding matches \0. – Andy Finkenstadt May 20 '11 at 13:40
  • possible duplicate of [How to initialize a const char* and/or const std::string in C++ with a sequence of UTF-8 character?](http://stackoverflow.com/questions/3881264/how-to-initialize-a-const-char-and-or-const-stdstring-in-c-with-a-sequence-o) – Nemanja Trifunovic May 20 '11 at 13:40
  • @Andy Finkenstadt The C++ standard requires that for any encoding used in C++, no byte can be `'\0'`. UTF-8 guarantees this, but so do most other multibyte encodings. – James Kanze May 23 '11 at 08:21
  • Technically at UTF-8 encoding can contain a `\0`, but only if the text itself contains a `\0`, which doesn't make sense for most "text". – Mooing Duck Sep 29 '13 at 19:40

5 Answers5

16

The encoding for narrow character string literals is implementation defined, so you'd really have to read the documentation (if you can find it). A quick experiment shows that both VC++ (VC8, anyway) and g++ (4.4.2, anyway) actually just copy the bytes from the source file; the string literal will be in whatever encoding your editor saved it in. (This is clearly in violation of the standard, but it seems to be common practice.)

C++11 has UTF-8 string literals, which would allow you to write u8"text", and be ensured that "text" was encoded in UTF-8. But I don't really expect it to work reliably: the problem is that in order to do this, the compiler has to know what encoding your source file has. In all probability, compiler writers will continue to ignore the issue, just copying the bytes from the source file, and achieve conformance simply be documenting that the source file must be in UTF-8 for these features to work.

James Kanze
  • 150,581
  • 18
  • 184
  • 329
  • 6
    The notion that a working program can be made buggy by changing the encoding scheme of the source file from UTF-8 to UTF-16 reinforces my impression that C++ is pure chaos! Someone please tell me that this isn't true. :( – Jeffrey L Whitledge May 20 '11 at 13:39
  • 3
    @Whitledge The notion that any program which reads text will have problems coping with input without knowing its encoding doesn't seem particularly surprising to me; I don't see how it can be otherwise. The C++ standard is quite clear about what should happen for a given sequence of input characters (although the two most widely used compilers ignore the standard in that regard), but it can't have much to say about how the compiler interprets the encoding of the input. (Most platforms, for example, don't support UTF-16.) – James Kanze May 20 '11 at 14:17
  • 2
    @James Kanze - Obviously *not knowing* the encoding would cause problems. That was not what I was talking about. It sounded to me like a program would compile with either UTF-8 or UTF-16, but the behavior of the compiled program would be different depending on the encoding. If most platforms don't support UTF-16, then I guess this isn't the case at all. It sounds like C++ source files are not really text files at all, but binary files with some similarity to text. If that is the case, then the "encoding" would certainly make a difference, since it isn't really a text encoding. – Jeffrey L Whitledge May 20 '11 at 14:27
  • If I understand correctly, the only reason a C++ source file could be understood as UTF-8 (in an editor that knows about encodings) is because (1) UTF-8 is binary-compatible with ASCII for characters in the (7-bit) ASCII range and (2) C++ is transparent with regard to characters outside of the ASCII range. As someone with little C++ experience, that is an interesting way to think about source code! – Jeffrey L Whitledge May 20 '11 at 14:35
  • 1
    I'm not sure about the standards violation. Clearly, a compiler that defined the charset as UTF8 is complaint. One that defines is as UTF8 or Latin1 depending on a `/charset:` switch would be also compliant. One that defines it as UTF8 or Latin1 depending on the input charset would be, too - the standard doesn't mandate a lot in this regard, as long as it's documented. – MSalters May 20 '11 at 14:48
  • @Jeffry - The idea is that C++ should **allow** a wide range of implementations by not specifying all the tiny details down to the bit level. Some C++ programs have the source encoded in EBCDIC and are still valid. No one expects UTF-16 encoded string literals in those programs! – Bo Persson May 20 '11 at 15:01
  • 2
    @Bo Persson - My prior expectation was that the encoding of the source file and the encoding of the target platform were distinct concepts, and string literals would become string data in whatever target encoding was specified or implied for that literal, regardless of the encoding of the source (so long as a proper mapping between the two was available). This would imply that a source code file could be converted from one encoding to another without altering the resulting compiled binaries – Jeffrey L Whitledge May 20 '11 at 15:29
  • 1
    (if the encoding of the source were correctly indicated, and a mapping existed between all of the characters, of course). But my prior expectations are clearly incorrect. It sounds like the C++ compiler does a binary copy of the string literals from the source file into the object file, implying that C++ source files are not text files at all, but they actually encode binary data. – Jeffrey L Whitledge May 20 '11 at 15:29
  • @Jeffry All files on your computer are binary. It's only by specific conventions that some bit patterns are understood as character encodings. Those conventions, however, are far from universal, and **every** program which reads text (and not just the C++ compiler) has to deal with it. C++ actually solves the problem better than most, at least as far as the standard is concerned. (Java's solution here, for example, is copied almost verbatim from the C++ standard.) – James Kanze May 20 '11 at 15:54
  • 1
    @MSalters The compiler is allowed a great deal of freedom with regards to what it accepts as input, and how it interprets it; the C++ standard definitely doesn't impose a specific encoding for the input. The standard does make two important restrictions, however. The first is that it document what it does, and the second is that once it has translated the input into Unicode, it generate the same (documented) encoding for string literals, regardless of the input encoding. Neither g++ nor VC++ do either. – James Kanze May 20 '11 at 15:58
  • @Jeffrey Re your response to Bo. What you expect is what the C++ standard requires, up to a certain point. If the compiler is generating code for a different platform, the binary image (including the encoding of characters in a string literal) may change---you want the compiler to generate the normal native encoding. And a compiler is not required to accept all encodings: it could, for example, say that only UTF-8 is supported. In which case, the compiler must document this fact, and generate an error if it encounters something that wouldn't be legal UTF-8. – James Kanze May 20 '11 at 16:01
  • 1
    @Jeffrey I think that is a quite nice summary. Yes, C++ source files are binary files (in contrast to e.g. Python files, where the encoding defaults to ASCII/UTF-8 and has to be specified if different), and C++ (before C++0x) has no real string data type at all—`char` arrays are just byte arrays. There are some conventions beyond that: for example, Windows fixes `wchar_t` to be a UTF-16 code unit, and thus arrays of `wchar_t` are always UTF-16 strings. C++0x cleans up the mess a bit by introducing UTF-8, UTF-16 and UTF-32 code units, string data types and literals. – Philipp May 21 '11 at 14:48
  • @Philipp All files on a computer are binary, not just C++ source files. The question is how a given program (in this case the C++ compiler) interprets them. The C++ standard imposes certain behaviors, while leaving others up to the compiler: which input encodings are accepted and how you specify them, for example, is implementation defined. But just copying the source file bytes into a string literal is not allowed---even if that's what g++ and VC++ do. – James Kanze May 23 '11 at 08:24
  • @James Kanze - The distinction between text and binary is not meaningless. There are a class of programs that take text files as input and interpret them according to a specified character encoding (usually from an operating system setting, an HTTP header, command line switch, or some other source). Among these programs are search tools, text editors/word processors, browsers, FTP clients, source control systems, and some compilers. It sounds like C++ compilers are not generally among these programs. – Jeffrey L Whitledge May 23 '11 at 21:08
  • @Jeffrey L Whitledge The distinction between text and binary is one of the conventions used by the application code reading the file; there is no difference in the files themselves. And there is text and text; XML is text, but it is also more. Some text formats, like XML, do have provisions for indicating the encoding. Others, like C++ sources (but also the sources for all other languages I know), don't have such a provision. In this regard, C++ is like all other programming languages. – James Kanze May 24 '11 at 08:21
  • "The distinction between text and binary is one of the conventions used by the application code reading the file; there is no difference in the files themselves." This is exactly why I said, "There are a class of programs..." rather than "There are a class of files...". The Microsoft C# compiler will accept files in any Windows code page, indicated by the "/codepage:" switch. If the source were converted from one codepage to another (assuming the characters used exist in both codepages), then the object files that result will be identical. – Jeffrey L Whitledge May 24 '11 at 15:27
  • Just now, I converted a C# source file from ASCII to EBCDIC and compiled it with the /codepage:37 switch. It worked just fine. The character encoding is completely transparent. If C++ does a *binary* copy of bytes from the source file, then this same trick will not work in C++. – Jeffrey L Whitledge May 24 '11 at 15:29
  • 1
    @Jeffrey Sure, and there's nothing to prevent a C++ compiler from doing likewise. And I'm sure that both will act up badly if you lie about the encoding. (To be fair, the issue is more complex in C++, because of include files. Compiling the system headers as if they were EBCDIC is definitly not a good idea. But solutions for this exist as well, if the compiler implementors wanted to use them.) – James Kanze May 24 '11 at 15:55
  • 1
    @Jeffrey And just to remind you (since I've already pointed it out): a C++ compiler which simply does a binary copy of bytes from the source file (for string literals) is *not* compliant. The C++ standard requires more than that. – James Kanze May 24 '11 at 15:57
4

If the text you want to put in the string is in your source code, make sure your source code file is in UTF-8.

If that don't work, try maybe using \u1234 with 1234 being a code point value.

You can also try to use UTF8-CPP maybe.

Take a look at this answer : Using Unicode in C++ source code

Community
  • 1
  • 1
Klaim
  • 67,274
  • 36
  • 133
  • 188
1

It is possible, save the file in UTF-8 without BOM signature encoding.

//Save As UTF8 without BOM signature
#include<stdio.h>
#include<windows.h>
int main(){
    SetConsoleOutputCP(65001);
    char *c1 = "aäáéöő";
    char *c2 = new char[strlen("aäáéöő")];
    strcpy(c2,c1);
    printf("%s\n",c1);
    printf("%s\n",c2);
}

Result:

 D:\Debug>program
aäáéöő
aäáéöő

The result of redirection program is really UTF8 encoded file.
UTF8 file
This is compiler - independent answer (compile on Windows).
(A similar question.)

Community
  • 1
  • 1
vladasimovic
  • 310
  • 3
  • 5
  • 2
    undefined behavior code, as you allocate enough memory for 6 characters in c2, while you copy 7 characters form c1 to c2. – neo5003 Jun 04 '19 at 13:50
1

See this MSDN article which talks about converting between string types (that should give you examples on how to use them). The strings types that are covered include char *, wchar_t*, _bstr_t, CComBSTR, CString, basic_string, and System.String:

How to: Convert Between Various String Types

yasouser
  • 5,113
  • 2
  • 27
  • 41
1

There is a hotfix for VisualStudio 2010 SP1 which can help: http://support.microsoft.com/kb/980263.

The hotfix adds a pragma to override visual studio's control the character encoding for the char type:

#pragma execution_character_set("utf-8")

Without the pragma, char* based literals are typically interpreted as the default code page (typically 1252)

This should all be superseded eventually by new string literal prefix modifiers specified by C++0x (u8, u, and U for utf-8, utf-16, and utf-32 respectively), which ideally will be supprted in the next major version of Visual Studio after 2010.

Zoner
  • 616
  • 6
  • 5