5

I printed some UTF-16 encoded characters and tried to display it in Firefox and it displayed it as �.

So I went to Tools->Encoding and changed the encoding from UTF-8 to UTF-16 (I also tried changing charset directly in the HTML) However, when I did that, my page was completely flooded with symbols:

਍ℼ佄呃偙⁅瑨汭ാ㰊瑨汭ാഊ㰊敨摡ാ †ഠ †㰠楴汴㹥楬畮⁸‭楆敲潦⁸楤灳慬獹朠牡慢敧挠慨慲瑣牥⁳湩氠敩⁵景眠扥 瀠条⁥‭畓数⁲獕牥⼼楴汴㹥਍††氼湩敲㵬猢潨瑲畣⁴捩湯•牨晥∽瑨灴⼺振湤献瑳瑡捩渮瑥猯灵牥獵牥椯杭是癡捩湯椮潣㸢਍††氼湩敲㵬愢灰敬琭畯档椭潣≮栠敲㵦栢瑴㩰⼯摣⹮獳慴楴⹣敮............

How can web browsers display UTF-16 characters without wrecking the page?

allenylzhou
  • 1,431
  • 4
  • 19
  • 36

4 Answers4

6

The “flooded with symbols” excerpt looks like an HTML document that is UTF-8 encoded but treated as if it were UTF-16 encoded. Or it might contain mostly UTF-8 data with some UTF-16 encoded data thrown in, which won’t work.

If you save your data as properly UTF-16 encoded and declare the encoding in HTTP headers and/or meta tags, then some browsers will display it OK, some won’t. Search engines generally fail to process UTF-16, and UTF-16 is mostly not used and should not be used on the web, except by mutual agreement between consenting well-informed partners.

Jukka K. Korpela
  • 195,524
  • 37
  • 270
  • 390
3

Firefox could not figure the correct charset in your document. For web pages head meta tag should be used to indicate the content's charset. It should be placed in the beginning of the HTML file indicating which charset the browser should use for the rest of the file.

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

So the browser is charset blind until it reads that line. But using utf-8 is no problem. Because every character up to that point is encoded in utf-8 the same way it would be in ASCII (same goes for latin-1 and others). That's not the case in utf-16.

W3C says:

There are three different Unicode character encodings: UTF-8, UTF-16 and UTF-32. Of these three, only UTF-8 should be used for Web content.

So you should use utf-8. But if you still want to try something with utf-16 use the BOM in the begging of your file. You're going to give your browser a better chance of figuring it out and properly decode the content.

This other answer is very succinct about utf-16 usage.

While Joel gives a full lesson on character encoding and why HTML uses it declaration inside the content and not as a header information.

Miguel Silva
  • 633
  • 5
  • 12
1

Sending UTF-16 data as a Web page to browsers is an XSS risk in older browsers. (See another answer.) Don’t do it. Instead, convert the data to UTF-8 on the server and send UTF-8 over HTTP.

Community
  • 1
  • 1
hsivonen
  • 7,908
  • 1
  • 30
  • 35
0

The way to make this work is for the page to say what encoding it's in. In the case of UTF-16, it also helps to include a BOM. The "flooded with Chinese" effect is most likely because your page is UTF-16LE but the browser treated it as UTF-16BE or vice versa...

Boris Zbarsky
  • 34,758
  • 5
  • 52
  • 55