Let's say I need to write a function that searches a block of text for something that looks like a URL, and wraps that portion of the text in an HTML <a href="...">anchor</a>
tag. Suppose one of the requirements dictates that the function must be able to detect a standalone domain name, like example.com
, which lacks protocol and path components, and convert it into a link to http://example.com
.
Throwing a quick mockup together using regexes in JavaScript:
function htmlify(sourceText) {
var detector = /([^\s]+\.(?:com|net| ...SNIP... |org|biz))/g;
return sourceText.replace(detector, function(match, p1) {
return '<a href="http://' + p1 + '">' + p1 + '</a>';
});
}
That works pretty well, but the detector
regex needs a list of all the TLDs that currently exist in the world. A few years ago this list would've remained relatively static, but now that generic TLDs are being registered constantly, this regex would get pretty stale pretty quickly. But no problem, right? Just pull the list from the IANA site, parse and filter it, dynamically build a new regex... package and deploy the app... and... bleh. This is rapidly becoming ugly.
And yet, when I type dad.coffee
into the Chrome or Firefox address bar and hit Enter, it takes me directly to that domain instead of treating it as a search term. How do they do it? Are they using a constantly-updating database and comparing the input text against it? Are they doing a DNS lookup prefetch, trying to see if it would return an NXDOMAIN? Something more clever?
ALSO: Is the requirement itself fundamentally flawed? Say somebody entered this text, which is clearly not supposed to be a domain name:
SELECT posts.id FROM posts;
.id
is a valid TLD, and therefore posts.id
will become a link to an unintended site. I don't see a way to prevent that, which leads me to believe the problem might not have a single ideal solution. Or does it?
EDIT: I did some testing with Wireshark and Chrome. It appears that any address bar input that looks like a FQDN will get looked up in DNS. Even single words are checked against every domain suffix in the system's DNS search list. This is mixed with a flurry of HTTPS traffic to Google, which is likely populating the find-as-you-type list. Not sure if Google is "helping" the browser arrive at its final decision, or if that happens entirely client-side.