Searching text for domain names with new generic TLDs

Question

Let's say I need to write a function that searches a block of text for something that looks like a URL, and wraps that portion of the text in an HTML <a href="...">anchor</a> tag. Suppose one of the requirements dictates that the function must be able to detect a standalone domain name, like example.com, which lacks protocol and path components, and convert it into a link to http://example.com.

Throwing a quick mockup together using regexes in JavaScript:

function htmlify(sourceText) {
    var detector = /([^\s]+\.(?:com|net| ...SNIP... |org|biz))/g;

    return sourceText.replace(detector, function(match, p1) {
        return '<a href="http://' + p1 + '">' + p1 + '</a>';
    });
}

That works pretty well, but the detector regex needs a list of all the TLDs that currently exist in the world. A few years ago this list would've remained relatively static, but now that generic TLDs are being registered constantly, this regex would get pretty stale pretty quickly. But no problem, right? Just pull the list from the IANA site, parse and filter it, dynamically build a new regex... package and deploy the app... and... bleh. This is rapidly becoming ugly.

And yet, when I type dad.coffee into the Chrome or Firefox address bar and hit Enter, it takes me directly to that domain instead of treating it as a search term. How do they do it? Are they using a constantly-updating database and comparing the input text against it? Are they doing a DNS lookup prefetch, trying to see if it would return an NXDOMAIN? Something more clever?

ALSO: Is the requirement itself fundamentally flawed? Say somebody entered this text, which is clearly not supposed to be a domain name:

SELECT posts.id FROM posts;

.id is a valid TLD, and therefore posts.id will become a link to an unintended site. I don't see a way to prevent that, which leads me to believe the problem might not have a single ideal solution. Or does it?

EDIT: I did some testing with Wireshark and Chrome. It appears that any address bar input that looks like a FQDN will get looked up in DNS. Even single words are checked against every domain suffix in the system's DNS search list. This is mixed with a flurry of HTTPS traffic to Google, which is likely populating the find-as-you-type list. Not sure if Google is "helping" the browser arrive at its final decision, or if that happens entirely client-side.

You could run Wireshark while you type in the address bar to see what it's doing. My guess is they do it with a database, at the same time as they're looking up search completions and prefetching search results. — Barmar, May 19 '14 at 19:50
@Barmar I would think (hope) that traffic would be encrypted with SSL/TLS, no? — smitelli, May 19 '14 at 19:57
All you want to know is whether it's asking the Google server or doing DNS pre-fetching. Since DNS is not encrypted, you'd be able to see that. If it's sending to Google, it doesn't matter if it's encrypted or not. — Barmar, May 19 '14 at 20:06

score 1 · Answer 1 · edited May 23 '17 at 12:11

1

First you ask:

How they do it?

Firefox doesn't. In Firefox, there is no validation of the TLD. If you paste dad.coffeeandmilk into the address par and press enter, Firefox will also try to take you there, and you will get:

Firefox can't find the server at www.dad.coffeeandmilk.

Second you ask:

The problem might not have a single ideal solution. Or does it?

Your hunch is right. There is no way to ensure that you can remove "fake" domain names 100% of the time because TLDs can occur in other contexts, such as VB.NET. However, here are a few hints to help with your quest:

A. People stopped trying to match every single TLD years ago. You may still find a few mega-regexes floating around to match an email address, but they are just for sportsmanship.

B. You can try to remove certain contexts where you know a url should not happen. For instance, if you have clear markers for your SQL strings, you can take them out. See Match a Pattern Except in Situations s1 s2 s3

C. To illustrate point A, this is what you find today for the url in the RegexBuddy library (removing the http portion):

[-A-Z0-9+&@#/%?=~_|$!:,.;]*[A-Z0-9+&@#/%=~_|$]

edited May 23 '17 at 12:11

Community

1
1

answered May 19 '14 at 19:50

zx81

41,100
9
89
105

Pasting `dad.coffeeandmilk` into Chrome's address bar takes me to a Google search result page asking "Did you mean: dad.coffee and milk". So it either must've known ahead of time the name wasn't valid, or caught the DNS failure and redirected me to search results in an attempt to be helpful. – smitelli May 19 '14 at 19:55
@smitelli Edited `They don't` to `Firefox doesn't`. :) At any rate... Not a big deal for a browser to keep an updated list of current TLDs. Presumably they're not looking it up with every request. Are you able to do that in your application (maintain a list that you update maybe once daily?) – zx81 May 19 '14 at 19:59
I certainly could auto-update the TLD list, but it makes the idea of solely using a regex to do the find/replace seem less and less appealing. As of this writing, such a regex would be around 3 KB, and it will only grow as the years go by. – smitelli May 19 '14 at 22:04

score 0 · Answer 2 · answered May 19 '14 at 22:21

You can simply do a DNS lookup of anything that's in the form xxx.yyy. Words connected by dots are not common in text other than as domain names, so this should not cause an excessive number of DNS lookups. You could keep a cache of results to avoid redundant lookups.

There's one context where words like this are common, though: programming code. If you have any kind of markup hints that code was posted, don't try to look for URLs in these blocks.

Searching text for domain names with new generic TLDs

2 Answers2