0

I have a bunch of functions that are filtering a page down to the domains that are attached to email addresses. It's all working great except for one small thing, some of the links are coming out like this:

EXAMPLE.COM
EXAMPLE.ORG.
EXAMPLE.ORG>.
EXAMPLE.COM"
EXAMPLE.COM".
EXAMPLE.COM).
EXAMPLE.COM(COMMENT)"
DEPT.EXAMPLE.COM
EXAMPLE.ORG
EXAMPLE.COM.

I want to figure out one last filter (regex or not) that will remove everything after the TLD. All of these items are in an array.

EDIT

The function I'm using:

function filterByDomain(array) {
    var regex = new RegExp("([^.\n]+\.[a-z]{2,6}\b)", 'gi');
    return array.filter(function(text){
        return regex.test(text);
    });
}
Markus
  • 3,225
  • 6
  • 35
  • 47
watzon
  • 2,401
  • 2
  • 33
  • 65
  • I'd do it like this, `(.+)\.[\w]`. This will search till the last period with a word character after it. This will keep the second level domain in foreign domains though, for example `ac.uk`, the `ac` would remain. – chris85 May 30 '15 at 15:45
  • @mascaliente: I have provided an update to my answer based on your edit. – anubhava May 31 '15 at 14:32

2 Answers2

2

You can probably use this regex to match your TLD for each case:

/^[^.\n]+\.[a-z]{2,63}$/gim

RegEx Demo

You validation function can be:

function filterByDomain(array) {
    var regex = /^[^.\n]+\.[a-z]{2,63}$/gim;
    return array.filter(function(text){
        return regex.test(text);
    });
}

PS: Do read this Q & A to see that up to 63 characters are allowed in TLD.

Community
  • 1
  • 1
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • For some reason the `\b` is breaking my code. I'm updating my question with the function I'm using – watzon May 30 '15 at 15:44
  • 1
    Oh don't use `RegExp` constructor (that requires `\\b` btw). Just use `var regex = /([^.\n]+\.[a-z]{2,3}\b)/gi` – anubhava May 30 '15 at 15:46
  • Please note that a dot after the TLD is actually a valid (and strictly speaking, required) part of a FQDN. – Cu3PO42 May 30 '15 at 16:17
  • 1
    `.[a-z]{2,3}` is not a good idea: there is `.info` and even generic TLDs – Jan Turoň May 30 '15 at 18:23
  • After your edit, you just substituted the `{2,6}` range in the OP question for `{2,5}`. Are you absolutely certain that this answers the question, sir? – Jan Turoň May 31 '15 at 11:48
  • Are you sure you have read the [original question posted](http://stackoverflow.com/revisions/30547948/2) which my answer was based upon. Now coming back to TLDs it seems even 6 is not enough and [they allow up to 63 characters](http://stackoverflow.com/questions/9238640/how-long-can-a-tld-possibly-be). – anubhava May 31 '15 at 14:28
0

I'd match all leading [\w.] and omit the last dot, if any:

var result = url.match(/^[\w\.]+/).join("");
if(result.slice(-1)==".") result = result.slice(0,-1);

With note that \w should be replaced for something more sophisticated:

  • _ is part of \w set but should not be in url path
  • - is not part of \w but can be in url not adjacent to . or -

To keep the regexp simple and the code readable, I'd do it this way

  1. substitute _ for # in url (both # and _ can be only after TLD)
  2. substitute - for _ (_ is part of \w)
  3. after the regexp test, substitute _ back for -

URL like www.-example-.com would still pass, can be detected by searching for [.-]{2,}

Jan Turoň
  • 31,451
  • 23
  • 125
  • 169