5

I checked the HTML code of my webpage and validated it using Firefox's HTML Validator add-on, and I saw that it complaints about the href attribute of a link, which contains Unicode characters, which are not URL encoded.

The current URL is:

<a href='/اخبار'>Persian News</a>

However, the validator wants it to be:

<a href='/%D8%A7%D8%AE%D8%A8%D8%A7%D8%B1'>Persian News</>

I've tested this link in almost every browser (even back to IE6). It works just fine. So, what is the problem here? Why should I encode it? Is validator out of date? What problem may I encounter of not URL encoding Unicode characters inside the href attribute of an <a>tag?

Saeed Neamati
  • 35,341
  • 41
  • 136
  • 188

4 Answers4

5

URLs can only be sent over the Internet using the ASCII character-set.

Usually, browser does the encoding for you.

hoymkot
  • 438
  • 1
  • 4
  • 11
  • See also RFC2396 (http://www.ietf.org/rfc/rfc2396.txt): "A URI is always in an 'escaped' form". In HTML5, UTF-8 is allowed. See answer by @Ixgr – koppor Nov 12 '13 at 13:11
3

It depends on what standard you want your page to conform to:

  • For (X)HTML5, URIs containing non-ASCII characters (i.e., IRIs) are valid, as long as your document is encoded in UTF-8 or UTF-16 and the MIME headers are sent accordingly.

  • In HTML4/XHTML1 documents, all non-ASCII characters always have to be escaped.

See also the answer to Are IRIs valid as HTML attribute values?.

Community
  • 1
  • 1
lxgr
  • 3,719
  • 7
  • 31
  • 46
3

Browsers that do not support this language (encoding) will not be able to open the URL. You should encode it to make sure that everybody is able to use all functionality of your website.

dwalldorf
  • 1,379
  • 3
  • 12
  • 21
  • I think that nowadays IE6 is the oldest, most buggiest browser. If it supports the links, then the validator may be out of date. – Saeed Neamati Nov 01 '11 at 08:21
  • It might look different if the language is not known to your system. Have you tried it on MacOS and linux? Also might be scarry for crawlers and screenreaders. Maybe this helps you: http://en.wikipedia.org/wiki/Percent-encoding – dwalldorf Nov 01 '11 at 08:30
2

Yes, I've test this with <meta http-equiv="Content-type" content="text/html; charset=windows-1251"/> which meant for Russian and the link just turn <a href="/?????">Persian News</a> so you need a proper charset and encoding to make it works fine.

Fadli Saad
  • 222
  • 1
  • 3
  • 13
  • Didn't understand what you mean. Sorry. – Saeed Neamati Nov 01 '11 at 08:24
  • Isn't the default `Content-Type` set to `utf-8`? Then you need to encode it, only when you limit yourself to a specific charset. However, almost nobody uses Russian charset, to write Persian. They simply can use Arabic, or Persian charset. – Saeed Neamati Nov 01 '11 at 08:31
  • It's just an example, mean to reach global audience, there's a lot of limitation, as I am working on a multilanguage website (21 languages to be exact) and it's really painful to achieve a fully compliant with all the languages. It's a validator, don't rely on it 100%. Have you try to set the source file encoding to Windows-1256? – Fadli Saad Nov 01 '11 at 08:45