2

Most browsers, such as Firefox and Chrome, do Unicode normalization on URLs before requesting them. For example, when chrome or firefox want to open this link:

http://fa.wikipedia.org/wiki/سید_محمد_خاتمی

which contains persian Unicode characters, they automatically convert this string into:

http://fa.wikipedia.org/wiki/%D8%B3%DB%8C%D8%AF_%D9%85%D8%AD%D9%85%D8%AF_%D8%AE%D8%A7%D8%AA%D9%85%DB%8C

I want to modify the hyperlinks in my website in a way to prevent browsers from normalizing unicode characters, such that when a user clicks on a linke, its pure (original) URL is requested from the server.

Is there any trick for that? E.g. a small javascript code in the source page that links to such URLs.

UPDATE: When I request the url by a programming language, e.g. Java's HttpURLConnection, it requests the original URL and do not use any normalization (except that I explicitly call UrlNormalizer.normalize(url)). However, most browsers and Linux's GET command do the normalization.

Ali
  • 443
  • 4
  • 15
  • Is that allowed by the HTTP protocol? – Barmar Jan 31 '15 at 11:49
  • i assume that is back end stuff – paka Jan 31 '15 at 11:53
  • 1
    [URLs can only contain a certain set of ASCII characters.](http://stackoverflow.com/a/1547940/53114) Although UTF-8 is supported, they must be encoded on the wire using percent-encoding. That’s exactly what your browser does. – Gumbo Jan 31 '15 at 11:54
  • When I request the url by a programming language, e.g. Java's HttpURLConnection, it requests the original URL and do not use any normalization (except that I explicitly call UrlNormalizer.normalize(url)). However, most browsers and Linux's GET command do the normalization. Obviously, it is not a matter of back end stuff. – Ali Jan 31 '15 at 12:45

1 Answers1

6

For example, when chrome or firefox want to open this link: http://fa.wikipedia.org/wiki/سید_محمد_خاتمی

That's not a valid URI. It's an IRI. Web browsers and other client tools that support IRI will convert it to the ASCII-only URI form (percent-UTF-8-encoded paths and Punycode-encoded hostnames) for you behind the scenes.

When I request the url by a programming language, e.g. Java's HttpURLConnection, it requests the original URL

HttpURLConnection doesn't support IRI. It tries to send the URI as-is anyway, but it should really have rejected it for being invalid.

I want to modify the hyperlinks in my website in a way to prevent browsers from normalizing unicode characters, such that when a user clicks on a linke, its pure (original) URL is requested from the server.

It is not valid according to the HTTP standard to send raw non-ASCII bytes in the request-line (RFC7230 absolute path -> RFC3986 segment). Web servers do different, unpredictable things when presented with such invalid requests. It is at all times best avoided.

There is no way to tell IRI-aware browsers to ignore proper behaviour and send non-ASCII request lines, but why would you want to? What are you trying to do here?

bobince
  • 528,062
  • 107
  • 651
  • 834
  • Some browsers allow one to [disable Unicode domains](https://news.netcraft.com/archives/2005/02/15/firefox_to_disable_idn_support_as_phishing_defense.html), though. – Cees Timmerman Apr 17 '17 at 17:43