C++ and Java encodings

Question

I am trying to make a Java application and a VS C++ application communicate and send different messages to each other using Sockets. The only problem that I have so far - I am absolutely lost in their encodings.

By default Java uses UTF-8. This is as far as I am concerned a Unicode charset. In my VS project I have settings set to Unicode. Though for some reason when I debug my code I allways see my strings encoded as CP1252 in memory. Furthermore if I try to use CP1252 in Java it works fine for English letters, but whenever I try some russian letters I get a 3f byte for every letter. If on other hand I try to use UTF-8 in Java - each English letter is 1 byte long, but every Russian - 2 bytes long. Isnt it a multibyte encoding?

Some docs on C++ say that std::string(char) uses UTF-8 codepage, and std:wstring(wchar_t) - UTF-16. When I debug my application I see CP1252 encoding for both of them, though wstring has empty bytes between each letter.

Could you please explain how encodings behave in both Java and C++ and how should I communicate my 2 apps?

Can't help with the Java part, but in VC++, try going to Project->Properties->Configuration Properties->General->Character Set and making the value "Use Multi-Byte Character Set" — Proxy, Feb 03 '14 at 21:08

score 2 · Answer 1 · answered Feb 03 '14 at 20:17

2

UTF-8 has a variable-length per character. Common characters take less space by using up less bytes per character. More un-common characters take up more space because they have to be encoded in more bytes. Since most of this was invented in the US, guess which characters are shorter and which are longer?

If you want Sockets to work, then you will have to get both sides to agree on the encoding. Otherwise, you are fighting a loosing battle.

answered Feb 03 '14 at 20:17

CodeChimp

8,016
5
41
79

What would be easier then? Using CP1252 in Java or UTF-8 in C++ ? – black Feb 03 '14 at 20:24
1

First, I don't think it matters what encoding the VC++ app uses in memory, it's what it sends over the Socket. So, the question is: Is it easier to change your Java app to send CP1252, or is it easier to change the VC++ app to send UTF-8. I would prefer the latter, as I am a *nix head and hate everything Windows. But, I think that is a complete and total opinion based solely on my hatred. – CodeChimp Feb 03 '14 at 20:29

score 0 · Answer 2 · edited May 23 '17 at 12:23

0

it's not true that java do utf-8 encoding. You can write your source code in utf8 and compile it with some weird signs in attributes(sometimes really annoying).

The internal representation in java of strings is utf-16(see What is the Java's internal represention for String? Modified UTF-8? UTF-16?)

edited May 23 '17 at 12:23

Community

1
1

answered Feb 03 '14 at 20:18

user1363989

61
1
2

I am talking about someString.getBytes(Charset.defaultCharset()); – black Feb 03 '14 at 20:26
@black Your default character set in Java might be UTF-8 but this is dependant on your environment. Java may default to something different on another machine. You can avoid this uncertainty by using UTF-8 encoding regardless of the default encoding. – Peter Lawrey Feb 03 '14 at 20:33

score 0 · Answer 3 · answered Feb 03 '14 at 20:48

Unicode is a character set, UTF-8 and UTF-16 are encodings of Unicode. For English (actually ASCII) characters UTF-8 results in the same value as CP1252 and UTF-16 adds a zero byte. As you want to use Russian (Cyrillic) you can use UTF-8, UTF-16 or CP1251. But both applications must agree on the encoding.

For example, if you agreed on UTF-8, the following will convert a Java String s to an array of bytes using UTF-8:

byte[] b = s.getBytes("UTF-8");

Then:

outputStream.write(b);

will send the data on the socket.

C++ and Java encodings

3 Answers3