JavaScript to remove whatever is after the tld and before the whitespace

Question

I have a bunch of functions that are filtering a page down to the domains that are attached to email addresses. It's all working great except for one small thing, some of the links are coming out like this:

EXAMPLE.COM
EXAMPLE.ORG.
EXAMPLE.ORG>.
EXAMPLE.COM"
EXAMPLE.COM".
EXAMPLE.COM).
EXAMPLE.COM(COMMENT)"
DEPT.EXAMPLE.COM
EXAMPLE.ORG
EXAMPLE.COM.

I want to figure out one last filter (regex or not) that will remove everything after the TLD. All of these items are in an array.

EDIT

The function I'm using:

function filterByDomain(array) {
    var regex = new RegExp("([^.\n]+\.[a-z]{2,6}\b)", 'gi');
    return array.filter(function(text){
        return regex.test(text);
    });
}

I'd do it like this, `(.+)\.[\w]`. This will search till the last period with a word character after it. This will keep the second level domain in foreign domains though, for example `ac.uk`, the `ac` would remain. — chris85, May 30 '15 at 15:45
@mascaliente: I have provided an update to my answer based on your edit. — anubhava, May 31 '15 at 14:32

score 2 · Answer 1 · edited May 23 '17 at 12:06

2

You can probably use this regex to match your TLD for each case:

/^[^.\n]+\.[a-z]{2,63}$/gim

RegEx Demo

You validation function can be:

function filterByDomain(array) {
    var regex = /^[^.\n]+\.[a-z]{2,63}$/gim;
    return array.filter(function(text){
        return regex.test(text);
    });
}

PS: Do read this Q & A to see that up to 63 characters are allowed in TLD.

edited May 23 '17 at 12:06

Community

1
1

answered May 30 '15 at 15:37

anubhava

761,203
64
569
643

For some reason the `\b` is breaking my code. I'm updating my question with the function I'm using – watzon May 30 '15 at 15:44
1

Oh don't use `RegExp` constructor (that requires `\\b` btw). Just use `var regex = /([^.\n]+\.[a-z]{2,3}\b)/gi` – anubhava May 30 '15 at 15:46
Please note that a dot after the TLD is actually a valid (and strictly speaking, required) part of a FQDN. – Cu3PO42 May 30 '15 at 16:17
1

`.[a-z]{2,3}` is not a good idea: there is `.info` and even generic TLDs – Jan Turoň May 30 '15 at 18:23
After your edit, you just substituted the `{2,6}` range in the OP question for `{2,5}`. Are you absolutely certain that this answers the question, sir? – Jan Turoň May 31 '15 at 11:48
Are you sure you have read the [original question posted](http://stackoverflow.com/revisions/30547948/2) which my answer was based upon. Now coming back to TLDs it seems even 6 is not enough and [they allow up to 63 characters](http://stackoverflow.com/questions/9238640/how-long-can-a-tld-possibly-be). – anubhava May 31 '15 at 14:28

Jan Turoň · Answer 2 · 2015-05-30T18:30:17.970

I'd match all leading [\w.] and omit the last dot, if any:

var result = url.match(/^[\w\.]+/).join("");
if(result.slice(-1)==".") result = result.slice(0,-1);

With note that \w should be replaced for something more sophisticated:

_ is part of \w set but should not be in url path
- is not part of \w but can be in url not adjacent to . or -

To keep the regexp simple and the code readable, I'd do it this way

substitute _ for # in url (both # and _ can be only after TLD)
substitute - for _ (_ is part of \w)
after the regexp test, substitute _ back for -

URL like www.-example-.com would still pass, can be detected by searching for [.-]{2,}

JavaScript to remove whatever is after the tld and before the whitespace

2 Answers2