156

In 2010, would you serve URLs containing UTF-8 characters in a large web portal?

Unicode characters are forbidden as per the RFC on URLs (see here). They would have to be percent encoded to be standards compliant.

My main point, though, is serving the unencoded characters for the sole purpose of having nice-looking URLs, so percent encoding is out.

All major browsers seem to be parsing those URLs okay no matter what the RFC says. My general impression, though, is that it gets very shaky when leaving the domain of web browsers:

  • URLs getting copy+pasted into text files, E-Mails, even Web sites with a different encoding
  • HTTP Client libraries
  • Exotic browsers, RSS readers

Is my impression correct that trouble is to be expected here, and thus it's not a practical solution (yet) if you're serving a non-technical audience and it's important that all your links work properly even if quoted and passed on?

Is there some magic way of serving nice-looking URLs in HTML

http://www.example.com/düsseldorf?neighbourhood=Lörick

that can be copy+pasted with the special characters intact, but work correctly when re-used in older clients?

Community
  • 1
  • 1
Pekka
  • 442,112
  • 142
  • 972
  • 1,088
  • 17
    For its part, Firefox displays the Unicode characters in its URL bar but sends them to the server percentage encoded. Moreover, when a user copies the URL from the URL bar, Firefox ensures that the percentage encoded URL is copied to the clipboard. – Siddhartha Reddy Apr 30 '10 at 07:50

7 Answers7

144

Use percent encoding. Modern browsers will take care of display & paste issues and make it human-readable. E. g. http://ko.wikipedia.org/wiki/위키백과:대문

Edit: when you copy such an url in Firefox, the clipboard will hold the percent-encoded form (which is usually a good thing), but if you copy only a part of it, it will remain unencoded.

Tgr
  • 27,442
  • 12
  • 81
  • 118
  • Wow, actually you're right! If you cut'n'paste a %-encoded URL Firefox will turn it into the correct thing for display. – Dean Harding Apr 30 '10 at 07:44
  • Wow, I wasn't aware of this. Chances are this is the best solution! – Pekka Apr 30 '10 at 07:45
  • 41
    @Dean that's a fairly recent change - in 2005 all international wikipedias looked like a real %6D%65%73%73. – Roman Starkov Jan 09 '11 at 14:48
  • 2
    You can use the unencoded UTF-8 URLs, namely [IRIs](http://en.wikipedia.org/wiki/Internationalized_Resource_Identifier), in [HTML5](http://www.w3.org/html/wg/drafts/html/CR/infrastructure.html#urls) documents by now. If you do that, all major browsers will understand it and display it correctly in their address bar. – Oliver Oct 23 '13 at 12:54
  • What bytes do modern browsers send to to servers in the request line `GET /images/logo.png HTTP/1.1`? Do they always percent-encode the URL? – Flimm Sep 11 '15 at 16:34
  • [RfC 3986](https://tools.ietf.org/html/rfc3986) has the details but basically alphanumeric and `_.-` are never encoded, `!$'*,:~@` might or might not be encoded (they don't need to be but some implementations do it anyway), `/?#[]&=+` might or might not be encoded and (depending on which part of the URL it happens in) it might change the meaning of the URI when they are (e.g. a web server might interpret `images/logo.png` as the `logo.png` file in the `images` directory and `images%2Flogo.png` as a file called `images/logo.png` in the root directory), everything else should always be encoded. – Tgr Sep 11 '15 at 17:50
  • Browsers usually get the URI from the link you click on so they don't need to encode anything, but if you type the address in manually, I believe they usually do the least amount of encoding that's necessary. – Tgr Sep 11 '15 at 17:52
  • Is there a way to guess if a browser supports this human-readable form to fall back to de-accented URLs for browsers that don't? – Neme Aug 28 '17 at 13:57
  • Probably not short of looking it up in some kind of browser support database, but pretty much everything supports it these days. (Percent-encoded URL fragments are a different matter altogether, though.) – Tgr Aug 28 '17 at 21:21
  • To clarify, the URL in the answer would look like: `http://ko.wikipedia.org/wiki/%EC%9C%84%ED%82%A4%EB%B0%B1%EA%B3%BC:%EB%8C%80%EB%AC%B8` – palswim Nov 05 '18 at 21:40
99

What Tgr said. Background:

http://www.example.com/düsseldorf?neighbourhood=Lörick

That's not a URI. But it is an IRI.

You can't include an IRI in an HTML4 document; the type of attributes like href is defined as URI and not IRI. Some browsers will handle an IRI here anyway, but it's not really a good idea.

To encode an IRI into a URI, take the path and query parts, UTF-8-encode them then percent-encode the non-ASCII bytes:

http://www.example.com/d%C3%BCsseldorf?neighbourhood=L%C3%B6rick

If there are non-ASCII characters in the hostname part of the IRI, eg. http://例え.テスト/, they have be encoded using Punycode instead.

Now you have a URI. It's an ugly URI. But most browsers will hide that for you: copy and paste it into the address bar or follow it in a link and you'll see it displayed with the original Unicode characters. Wikipedia have been using this for years, eg.:

http://en.wikipedia.org/wiki/ɸ

The one browser whose behaviour is unpredictable and doesn't always display the pretty IRI version is...

...well, you know.

bobince
  • 528,062
  • 107
  • 651
  • 834
  • 37
    I know. One day, somebody has to take a big club and smack those Lynx developers on the head. Thanks for the excellent background info. – Pekka Apr 30 '10 at 11:48
  • 2
    @bobince And the one bot (fast forward to 2013) that also cannot handle non IRI URIs is... ...well, you know: bingbot! Go figure. – Tom Harrison May 28 '13 at 20:54
  • 1
    HTML5 finally supports IRIs. More info on the subject can be found in [this answer to a related question](http://stackoverflow.com/a/19542940/177710). – Oliver Oct 25 '13 at 12:46
  • 5
    Re: IE not always displaying pretty IRIs - they are protecting users from homograph-based phishing attacks. Check out http://www.w3.org/International/articles/idn-and-iri/ (specifically the section 'Domain names-and phishing') and http://blogs.msdn.com/b/ie/archive/2006/07/31/684337.aspx – codingoutloud Feb 15 '14 at 14:40
  • 3
    Domain names have nothing to do with this. All browsers disallow a wide range of characters to prevent phishing. Displaying non-ASCII characters in the path or query string part does not create a similar vilnerability. IE simply didn't bother to implement it. (And Firefox is the only one that implemented it for the fragment part as well.) – Tgr Jul 04 '15 at 00:38
22

Depending on your URL scheme, you can make the UTF-8 encoded part "not important". For example, if you look at Stack Overflow URLs, they're of the following form:

http://stackoverflow.com/questions/2742852/unicode-characters-in-urls

However, the server doesn't actually care if you get the part after the identifier wrong, so this also works:

http://stackoverflow.com/questions/2742852/これは、これを日本語のテキストです

So if you had a layout like this, then you could potentially use UTF-8 in the part after the identifier and it wouldn't really matter if it got garbled. Of course this probably only works in somewhat specialised circumstances...

Michael
  • 7,348
  • 10
  • 49
  • 86
Dean Harding
  • 71,468
  • 13
  • 145
  • 180
  • Hmmm, *very* clever thinking! It could still be that some clients choke on the characters no matter where they are located in the string, but it *would* eliminate all the problems with ordinary garbling when copy+pasting a URL, which I think is the most important part. Hadn't looked at SO's URL that way yet. Thanks! – Pekka Apr 30 '10 at 07:34
  • well, this still leaves word "questions" untranslated, plus there is stuff after hash #, which follows entire url, very nice trick though!! – Evgeny Apr 30 '10 at 16:39
  • 6
    自動翻訳機を使ってその日本語のURLを作ったね。 – Glutexo Aug 12 '16 at 11:58
8

Not sure if it is a good idea, but as mentioned in other comments and as I interpret it, many Unicode chars are valid in HTML5 URLs.

E.g., href docs say http://www.w3.org/TR/html5/links.html#attr-hyperlink-href:

The href attribute on a and area elements must have a value that is a valid URL potentially surrounded by spaces.

Then the definition of "valid URL" points to http://url.spec.whatwg.org/, which defines URL code points as:

ASCII alphanumeric, "!", "$", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", ":", ";", "=", "?", "@", "_", "~", and code points in the ranges U+00A0 to U+D7FF, U+E000 to U+FDCF, U+FDF0 to U+FFFD, U+10000 to U+1FFFD, U+20000 to U+2FFFD, U+30000 to U+3FFFD, U+40000 to U+4FFFD, U+50000 to U+5FFFD, U+60000 to U+6FFFD, U+70000 to U+7FFFD, U+80000 to U+8FFFD, U+90000 to U+9FFFD, U+A0000 to U+AFFFD, U+B0000 to U+BFFFD, U+C0000 to U+CFFFD, U+D0000 to U+DFFFD, U+E1000 to U+EFFFD, U+F0000 to U+FFFFD, U+100000 to U+10FFFD.

The term "URL code points" is then used in a few parts of the parsing algorithm, e.g. for the relative path state:

If c is not a URL code point and not "%", parse error.

Also the validator http://validator.w3.org/ passes for URLs like "你好", and does not pass for URLs with characters like spaces "a b"

Related: Which characters make a URL invalid?

Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
  • But both URLs (`"你好"` and `"a b"`) have to be percent encoded when making the HTTP request right? – Utku Aug 08 '16 at 04:53
  • @Utku for `"a b"` I'm pretty sure yes since space is not in the allowed list above. For `"你好"`, it is definitely the better idea to percent encode, but I don't know if it is just a question of "the implementations are not good enough" or the "standard says so". The HTML standard seems to allows those characters. But I think this is specified by the HTTP standard, not HTML. See also: http://stackoverflow.com/questions/912811/what-is-the-proper-way-to-url-encode-unicode-characters – Ciro Santilli OurBigBook.com Aug 08 '16 at 07:00
  • Yes, I was thinking of the HTTP standard, not HTML. – Utku Aug 08 '16 at 07:51
6

As all of these comments are true, you should note that as far as ICANN approved Arabic (Persian) and Chinese characters to be registered as Domain Name, all of the browser-making companies (Microsoft, Mozilla, Apple, etc.) have to support Unicode in URLs without any encoding, and those should be searchable by Google, etc.

So this issue will resolve ASAP.

Cornelius
  • 830
  • 11
  • 31
Nasser Hadjloo
  • 12,312
  • 15
  • 69
  • 100
  • 2
    @Nasser: True - we have special characters in german domains now, too - but those are encoded into ASCII characters using [Punycode](http://en.wikipedia.org/wiki/Punycode). While they are sure to work in major browsers, it will be a long time before every HTTP client library and exotic application will be able to deal with unencoded Unicode characters. – Pekka May 03 '10 at 07:33
  • @Pekka, I'm not sure but as I heard, all of browsers have to support Unicode URL at 4th quarter of 2010. (I'm Not Sure) – Nasser Hadjloo May 03 '10 at 07:44
  • The issue is complicated by the fact that not every user agent is a web browser. Largest example is google itself: It does not use common web browsers to do it's crawling. So would many libraries for API interaction etc. etc. — URLs are almost literally everywhere, not just in the WWW. Probably even on your file system right now. – Cornelius Jan 23 '14 at 12:41
  • Wow. We've got 2022 now and there are still lots of problems with dealing with URLs with non-ASCII symbols. For instance, Ruby still won't support them, pointing to an RFC that nobody obeys any more for purely practical reasons. I just had to write my own function to deal with it. – MDickten Jun 20 '22 at 10:34
1

For me this is the correct way, This just worked:

    $linker = rawurldecode("$link");
    <a href="<?php echo $link;?>"   target="_blank"><?php echo $linker ;?></a>

This worked, and now links are displayed properly:

http://newspaper.annahar.com/article/121638-معرض--جوزف-حرب-في-غاليري-جانين-ربيز-لوحاته-الجدية-تبحث-وتكتشف-وتفرض-الاحترام

Link found on:

http://www.galeriejaninerubeiz.com/newsite/news

Adrian
  • 10,246
  • 4
  • 44
  • 110
Peter Manoukian
  • 158
  • 1
  • 13
  • 2
    "links are displayed properly" - except that the StackOverflow markdown parser doesn't interpret URLs as intended! – MrWhite Oct 30 '15 at 12:18
0

Use percent-encoded form. Some (mainly old) computers running Windows XP for example do not support Unicode, but rather ISO encodings. That is the reason percent-encoded URLs were invented. Also, if you give a URL printed on paper to a user, containing characters that cannot be easily typed, that user may have a hard time typing it (or just ignore it). Percent-encoded form can even be used in many of the oldest machines that ever existed (although they don't support internet of course).

There is a downside though, as percent-encoded characters are longer than the original ones, thus possibly resulting in really long URLs. But just try to ignore it, or use a URL shortener (I would recommend goo.gl in this case, which makes a 13-character long URL). Also, if you don't want to register for a Google account, try bit.ly (bit.ly makes slightly longer URLs, with the length being 14 characters).

EKons
  • 887
  • 2
  • 20
  • 27