11

I just stumbled upon the following article:

http://www.josscrowcroft.com/2011/code/utf-8-multibyte-characters-in-url-parameters-%E2%9C%93/

The article talks about using UTF-8 characters in URL's.

I would like to know whether it is safe to use it.

I have basically the same setup (browser + OS) as the guy who wrote the article. So I can't really test it.

So... is it safe to use UTF-8 characters in URL's?

And the bonus question: If it's safe how come not many websites use it?

Dagg Nabbit
  • 75,346
  • 19
  • 113
  • 141
PeeHaa
  • 71,436
  • 58
  • 190
  • 262
  • possible duplicate of [Unicode characters in URLs](http://stackoverflow.com/questions/2742852/unicode-characters-in-urls) – Rafa Viotti Apr 17 '15 at 03:15

3 Answers3

7

Unicode characters in the url (I'm not talking about the domainname) are safe to use. There is no security risk, if you use them on your site. (There are some risks to the end user if he visits a fraudulent site using unicode on the page as Oded said).

The only real problem is how older browsers (and OSs) show them. Browsers not supporting them will show those ugly percentage encoded chars in the url. You probably also have to percentage-encode the urls inside the html in case older browsers don't encode it for you and the user can't follow the link (which is bad). Modern browsers show the decoded url in the addressbar, but use the encoded version to send the request, so the user always sees the pretty unicode characters.

Gerben
  • 16,747
  • 6
  • 37
  • 56
  • I agree with this answer. As for browser support this question is related: http://stackoverflow.com/questions/7962110/browser-support-unicode-url – enyo Feb 21 '12 at 12:53
  • 5
    It's definitely not 'safe'. There is no standard for using URL-encoded UTF-8 characters and there is no way of specifying a character set for non-ascii characters. You are free to use whatever URL-encoded characters you like, but there is absolutely no guarantee that any browser will interpret or display them in any particular way, and not surprisingly, YMMV. – Synchro Jan 13 '14 at 08:14
  • 1
    Good point. I can indeed find nothing in the URL, URI, or IRI spec about characters encodings other than ascii. – Gerben Jan 13 '14 at 14:37
  • There is one comment of use in RFC3986: "...a URI is assumed to be in the same character encoding as the surrounding text" - so you could have UTF-8 in URLs within a UTF-8 HTML document, and a user agent would know what to do with them, but as soon as it's not in that document (e.g. you text it to someone) it loses that contextual metadata and can only be in ASCII. It also says that a protocol can define an encoding explicitly, but HTTP doesn't do that (well it does, but it's ASCII). – Synchro Jan 15 '14 at 16:06
1

It is possible with any browser that supports IDN.

However, IDN is not well supported on the different web servers and the proxies and other internet infrastructure, hence most sites can't support it and be sure people can get to them...

And, as @Rook alludes to, there are still security issues with using UTF-8 this way (XSS for example).

Oded
  • 489,969
  • 99
  • 883
  • 1,009
  • Do you have any examples regarding UTF-8 and XSS?! – Chris Jul 08 '11 at 13:35
  • How could the infrastructure of proxies have problems with idn. domainnames are converter to punicode which is 100% compatible with older domainnames. Unicode inside the path are percentage encoded byte by byte. Percentage encoding is as old as HTTP so should work on every system. Maybe some webservers might have trouble with it, especially since the map urls to the file system that might not support unicode filenames. – Gerben Jul 08 '11 at 14:00
  • 1
    This question is only indirectly to do with IDN. IDN only applies to domain names, not any other elements in the URL. There is no particular relationship between IDN and URL-encoded characters. – Synchro Jan 13 '14 at 08:10
-8

UTF-8 has still got a long long way to go ... definitely not safe.

And culturally, I like it that way. I cannot imagine writing/remembering URL address made from Chinese letters, or they doing the same.

Rook
  • 60,248
  • 49
  • 165
  • 242
  • When do you ever have to remember the query part of an URL? He's not talking about UTF8 domain names (which also exist btw: http://www.öbb.at) - so it's safe to assume that nobody will every have to manually type the characters. – enyo Feb 21 '12 at 12:49