-2

I use a forum that has a policy against direct commercial links, so what I often do is to mangle it so it remains readable but requires manual copy/paste/edit in order to work. Instead of www.example.com I will use www•example•com . The SO post editor encodes that URI as you'd expect, replacing the with %E2%80%A2 (so https://www%E2%80%A2example%E2%80%A2com) but when I click the link I'm taken to https://xn--wwwexamplecom-kt6gha . That is also the HREF that the forum sends back after posting.

The xn-- header seems to be constant, the "glueing" of the 1st two domain components too but annoyingly the rest varies as a function of the domain name. The -kt6gha bit is domain-specific and the TLD can be glued to the rest as here or come after that alphanumeric part.

I'm guessing this conversion is deterministic, but can it be reversed? Preferably in a userscript.js so I can undo my own smart move for myself? ;)

RJVB
  • 698
  • 8
  • 18
  • Use `decodeURI` or `decodeURIComponent`. – double-beep Aug 22 '23 at 17:19
  • The decodeURI functions do not handle this encoding! At least not the versions in the browser I was using (an Electron-based application) but it looks that Firefox also doesn't know what to do with an uri like `https://xn--wwwexamplecom-kt6gha`. Of course that server/cname does *not* exist but you'd expect that they'd show the decoded version in the error message... – RJVB Aug 24 '23 at 16:31
  • `decodeURI` would work with this URL: `https://www%E2%80%A2example%E2%80%A2com`, not with the one that is punycoded. – double-beep Aug 25 '23 at 08:01
  • True, it does, so one really wonders why another, puny code would have been necessary for a particular class of URLs (with UTF-16 characters). I do see that a single character is encoded with 3 hex codes here; would the standard encoding not allow for a unique decoding of every possible code triplet for instance? – RJVB Aug 26 '23 at 10:53

1 Answers1

0

So this turns out to be the punicode, which is intended for the encoding of labels in the Internationalized Domain Names in Applications (IDNA) framework, such that these domain names may be represented in the ASCII character set allowed in the Domain Name System of the Internet.

I extracted and adapted the decoder from https://stackoverflow.com/a/301287/1460868 such that it works on full URLs:

    this.ToUnicode = function ( domain ) {
        var protocol = '';
        if (domain.startsWith('https://')) {
            protocol = 'https://';
            domain = domain.substring(8);
        } else if (domain.startsWith('http://')) {
            protocol = 'http://';
            domain = domain.substring(8);
        }
        var ua = domain.split('/');
        domain = ua[0];
        urlpath = ua.slice(1);
        var domain_array = domain.split(".");
        var out = [];
        for (var i=0; i < domain_array.length; ++i) {
            var s = domain_array[i];
            out.push(
                s.match(/^xn--/) ?
                punycode.decode(s.slice(4)) :
                s
            );
        }
        var result = protocol + out.join(".") + '/' + urlpath.join('/');
        return result;
    }

(that's the modified bit, apart from the stripped encoding functions.)

I can now call that in this snippet that does some unmangling of links done by silly upstream forum filters:

    // also do the same replacements in the URLs
    var links = document.getElementsByTagName('a');
    for (i = 0; i < links.length; i++) {
        var link = /[\/\.]xn--/.test(links[i].href) ?
                punycode.ToUnicode(links[i].href)
                : links[i].href;
        urlRegexs.forEach(function (value, index) {
            var newlink = link.replace(value, urlReplacements[index]);
            if (newlink !== link) {
                links[i].href = newlink;
            }
        });
    }

What I don't get though is why browsers do not do this, if the encoding is part of a standard!

RJVB
  • 698
  • 8
  • 18