1

I'm doing a test, how the Firefox encoding character.

But the fact confused me.

HTML code:

<html lang="zh_CN">
<head>
<title>some Chinese character</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<img src="http://localhost/xxx" />
</body>

The xxx is some Chinese characters. These character must be encode into format like %xx to transport by HTTP.

First, I encoding the source file in UTF-8. use firefox to open the html file. The img label will send a request, "xxx" character were encoded by UTF8.

  • (encode HTML source file by UTF8, charset=utf8, the browser encode URL by UTF)

I changed the meta into <meta http-equiv="Content-Type" content="text/html; charset=gbk"> but nothing changed.

  • (encode HTML source file by UTF8, charset=gbk, the browser encode URL by UTF)

Second, I save the source file in ANSI, maybe GBK or GB2312.

when the charset=gbk, still encoding the character by UTF8.

  • (encode HTML source file by GBK, charset=gbk, the browser encode URL by UTF)

BUT, when the charset=utf8, the characters were encoding by GBK. By the way, other Chinese character can't display in right way, e.g. the String in title.

  • (encode HTML source file by GBK, charset=utf8, the browser encode URL by GBK)

How to control the browser's encoding behavior?

HUA Di
  • 901
  • 8
  • 11
  • I did this test because I encounter a problem when force the browser's URL encoding to utf8, I changed the charset but nothing happened. so I think is there any other thing I don't understand about the browser encoding URL? – HUA Di Dec 22 '12 at 08:25

1 Answers1

2

UTF-8 is the standard for URL encoding. If you encode your source file physically in GBK, but use utf-8 in the content-type, you are just lying to the browser and will get inconsistent or non-working results.

When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent- encoded. For example, the character A would be represented as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as "%C3%80", and the character KATAKANA LETTER A would be represented as "%E3%82%A2

Community
  • 1
  • 1
Esailija
  • 138,174
  • 23
  • 272
  • 326
  • I changed my approach, directly print the encoded URL to webpage. Thanks your answer. – HUA Di Dec 26 '12 at 07:12
  • 1
    Well, strictly speaking RFC 3986 only uses "should" when talking about using UTF-8, so the standard does allow exceptions. However, in practice at least all modern browsers will encode URLs as UTF-8. – sleske Feb 08 '16 at 13:54