11

I call a webservice, that gives me back a response xml that has UTF-8 encoding. I checked that in java using getAllHeaders() method.

Now, in my java code, I take that response and then do some processing on it. And later, pass it on to a different service.

Now, I googled a bit and found out that by default the encoding in Java for strings is UTF-16.

In my response xml, one of the elements had a character É. Now this got screwed in the post processing request that I make to a different service.

Instead of sending É, it sent some jibberish stuff. Now I wanted to know, will there be really a lot of difference in the two of these encodings? And if I wanted to know what will É convert from UTF-8 to UTF-16, then how can I do that?

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
Kraken
  • 23,393
  • 37
  • 102
  • 162
  • How do you read and write your XML? JAXB? StAX? Can you show the code where you create the reader and writer? – Puce Mar 14 '14 at 12:08

4 Answers4

37

Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits.

Main UTF-8 pros:

  1. Basic ASCII characters like digits, Latin characters with no accents, etc. occupy one byte which is identical to US-ASCII representation. This way all US-ASCII strings become valid UTF-8, which provides decent backwards compatibility in many cases.
  2. No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too.

Main UTF-8 cons:

  1. Many common characters have different length, which slows indexing and calculating a string length terribly.

Main UTF-16 pros:

  1. Most reasonable characters, like Latin, Cyrillic, Chinese, Japanese can be represented with 2 bytes. Unless really exotic characters are needed, this means that the 16-bit subset of UTF-16 can be used as a fixed-length encoding, which speeds indexing.

Main UTF-16 cons:

  1. Lots of null bytes in US-ASCII strings, which means no null-terminated strings and a lot of wasted memory.

In general, UTF-16 is usually better for in-memory representation while UTF-8 is extremely good for text files and network protocol

Arjun Chaudhary
  • 2,373
  • 2
  • 19
  • 36
  • 1
    Nice reply. Can you kill my curiosity and perhaps name a practical use for UTF-32? For the life of me I can't think of a reason why it exists. A simple Google gets me no further than "speed optimization". – Gimby Mar 14 '14 at 12:26
  • I have a question, maybe a very trivial one. Taking example of a simple notepad. Let's say I call some service, that returns me UTF-8 encoded data. This is basically all the ASCII or maybe some other encoding. Now I have a character from webservice, ie say 'A'. Now this A will be mapped to something in UTF-8. For example 00000000 (8 bits). now when notepad interprets this, it converts it into 0000 (4 bits). Now, wont'it screw up everything for me? – Kraken Mar 14 '14 at 12:31
  • check my below answer – Arjun Chaudhary Mar 14 '14 at 12:37
  • UTF-32 is arguably the most human-readable of the Unicode Encoding Forms, because its big-endian hexadecimal representation is simply the Unicode Scalar Value without the “U+” prefix and zero-padded to eight digits – Arjun Chaudhary Mar 14 '14 at 12:40
  • umm.. maybe I am not sure of the question that I want to ask. Maybe I'll frame it well sometime later and ask as in a separate thread. – Kraken Mar 14 '14 at 12:48
4

There are two things:

  • the encoding in which you exchange data;
  • the internal string representation of Java.

You should not be preoccupied with the second point ;) The thing is to use the appropriate methods to convert from your data (byte arrays) to Strings (char arrays ultimately), and to convert form Strings to your data.

The most basic classes you can think of are CharsetDecoder and CharsetEncoder. But there are plenty others. String.getBytes(), all Readers and Writers are but two possible methods. And there are all static methods of Character as well.

If you see gibberish at some point, it means you failed to decode or encode from the original byte data to Java strings. But again, the fact that Java strings use UTF-16 is not relevant here.

In particular, you should be aware that when you create a Reader or Writer, you should specify the encoding; if you fail to do so, the default JVM encoding will be used, and it may, or may not, be UTF-8.

fge
  • 119,121
  • 33
  • 254
  • 329
2

This Website provide UTF TO UTF Conversion

http://www.fileformat.info/convert/text/utf2utf.htm

UTF-32 is arguably the most human-readable of the Unicode Encoding Forms, because its big-endian hexadecimal representation is simply the Unicode Scalar Value without the “U+” prefix and zero-padded to eight digits and While a UTF-32 representation does make the programming model somewhat simpler, the increased average storage size has real drawbacks, making a complete transition to UTF-32 less compelling.

HOWEVER

UTF-32 is the same as the old UCS-4 encoding and remains fixed width. Why can this remain fixed width? As UTF-16 is now the format that can encode the least amount of characters it set the limit for all formats. It was defined that 1,112,064 was the total number of code points that will ever be defined by either Unicode or ISO 10646. Since Unicode is now only defined from 0 to 10FFFF UTF-32 sounds a bit like a pointless encoding now as it's 32 bit wide, but only ever about 21 bits are used which makes this very wasteful.

Arjun Chaudhary
  • 2,373
  • 2
  • 19
  • 36
0

UTF-8: Generally speaking, you should use UTF-8. Most HTML documents use this encoding.

It uses at least 8 bits of data to store each character. This can lead to more efficient storage, especially when the text contains mostly English ASCII characters. But higher-order characters, such as non-ASCII characters, may require up to 24 bits each!

UTF-16: This encoding uses at least 16 bits to encode characters, including lower-order ASCII characters and higher-order non-ASCII characters.

If you are encoding text consisting of mostly non-English or non-ASCII characters, UTF-16 may result in smaller file size. But if you use UTF-16 to encode mostly ASCII text, it will use up more space.

Ashutosh gupta
  • 447
  • 4
  • 16