Questions tagged [utf-16]

UTF-16 is a character encoding that represents Unicode code points using either 2 or 4 bytes per character.

UTF-16 is a character encoding that describes code points in byte sequences of either two or four bytes. It is therefore a variable-width character encoding.

The algorithm for encoding code points as UTF-16 is described in RFC 2781.

There are three flavors of UTF-16, little-endian, big-endian and with BOM (see ).

Related tags

1193 questions
641
votes
14 answers

UTF-8, UTF-16, and UTF-32

What are the differences between UTF-8, UTF-16, and UTF-32? I understand that they will all store Unicode, and that each uses a different number of bytes to represent a character. Is there an advantage to choosing one over the other?
user60456
483
votes
9 answers

What are Unicode, UTF-8, and UTF-16?

What's the basis for Unicode and why the need for UTF-8 or UTF-16? I have researched this on Google and searched here as well, but it's not clear to me. In VSS, when doing a file comparison, sometimes there is a message saying the two files have…
SoftwareGeek
  • 15,234
  • 19
  • 61
  • 78
187
votes
7 answers

What is a "surrogate pair" in Java?

I was reading the documentation for StringBuffer, in particular the reverse() method. That documentation mentions something about surrogate pairs. What is a surrogate pair in this context? And what are low and high surrogates?
Raymond
  • 2,004
  • 2
  • 13
  • 10
170
votes
10 answers

Can I make git recognize a UTF-16 file as text?

I'm tracking a Virtual PC virtual machine file (*.vmc) in git, and after making a change git identified the file as binary and wouldn't diff it for me. I discovered that the file was encoded in UTF-16. Can git be taught to recognize that this file…
skiphoppy
  • 97,646
  • 72
  • 174
  • 218
151
votes
5 answers

Difference between UTF-8 and UTF-16?

Difference between UTF-8 and UTF-16? Why do we need these? MessageDigest md = MessageDigest.getInstance("SHA-256"); String text = "This is some text"; md.update(text.getBytes("UTF-8")); // Change this to "UTF-16" if needed byte[] digest =…
theJava
  • 14,620
  • 45
  • 131
  • 172
109
votes
7 answers

Convert UTF-8 with BOM to UTF-8 with no BOM in Python

Two questions here. I have a set of files which are usually UTF-8 with BOM. I'd like to convert them (ideally in place) to UTF-8 with no BOM. It seems like codecs.StreamRecoder(stream, encode, decode, Reader, Writer, errors) would handle this. But I…
timpone
  • 19,235
  • 36
  • 121
  • 211
94
votes
4 answers

Deprecated header replacement

A bit of foreground: my task required converting UTF-8 XML file to UTF-16 (with proper header, of course). And so I searched about usual ways of converting UTF-8 to UTF-16, and found out that one should use templates from . But now when it…
login_not_failed
  • 1,121
  • 2
  • 11
  • 19
86
votes
5 answers

What's the point of UTF-16?

I've never understood the point of UTF-16 encoding. If you need to be able to treat strings as random access (i.e. a code point is the same as a code unit) then you need UTF-32, since UTF-16 is still variable length. If you don't need this, then…
dsimcha
  • 67,514
  • 53
  • 213
  • 334
77
votes
5 answers

Difference between Big Endian and little Endian Byte order

What is the difference between Big Endian and Little Endian Byte order ? Both of these seem to be related to Unicode and UTF16. Where exactly do we use this?
web dunia
  • 9,381
  • 18
  • 52
  • 64
76
votes
3 answers

Why does .net use the UTF16 encoding for string, but uses UTF-8 as default for saving files?

From here Essentially, string uses the UTF-16 character encoding form But when saving vs StreamWriter : This constructor creates a StreamWriter with UTF-8 encoding without a Byte-Order Mark (BOM), I've seen this sample (broken link…
Royi Namir
  • 144,742
  • 138
  • 468
  • 792
75
votes
10 answers

grepping binary files and UTF16

Standard grep/pcregrep etc. can conveniently be used with binary files for ASCII or UTF8 data - is there a simple way to make them try UTF16 too (preferably simultaneously, but instead will do)? Data I'm trying to get is all ASCII anyway (references…
taw
  • 18,110
  • 15
  • 57
  • 76
62
votes
3 answers

Byte and char conversion in Java

If I convert a character to byte and then back to char, that character mysteriously disappears and becomes something else. How is this possible? This is the code: char a = 'È'; // line 1 byte b = (byte)a; // line 2 char c =…
user1883212
  • 7,539
  • 11
  • 46
  • 82
58
votes
2 answers

Unicode in C++11

I've been doing a bit of reading around the subject of Unicode -- specifically, UTF-8 -- (non) support in C++11, and I was hoping the gurus on Stack Overflow could reassure me that my understanding is correct, or point out where I've misunderstood…
Tristan Brindle
  • 16,281
  • 4
  • 39
  • 82
57
votes
5 answers

Java Unicode String length

I am trying hard to get the count of unicode string and tried various options. Looks like a small problem but struck in a big way. Here I am trying to get the length of the string str1. I am getting it as 6. But actually it is 3. moving the cursor…
user1611248
  • 708
  • 3
  • 7
  • 13
56
votes
3 answers

Manually converting unicode codepoints into UTF-8 and UTF-16

I have a university programming exam coming up, and one section is on unicode. I have checked all over for answers to this, and my lecturer is useless so that’s no help, so this is a last resort for you guys to possibly help. The question will be…
RSM
  • 14,540
  • 34
  • 97
  • 144
1
2 3
79 80