3

This should be simple, but it's eluding me. There are many good and bad regex methods to match a URL, with or without the protocol, with or without www. The problem I have is this (in javascript): if I use regex to match URLs in a text string, and set it so that it will match just 'domain.com', it also catches the domain of an e-mail address (the part after '@'), which I don't want. A negative lookbehind solves it - but obviously not in JS.

This is my nearest success so far:

 /^(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g

but that fails if the match is not at the start of the string. And I'm sure I'm tackling it the wrong way. Is there a simple answer out there anywhere?

EDIT: Revised regex to respond to a few of the comments below (sticks with 'www' rather than allowing sub-domains:

\b(www\.)?([^@])(\w*\.)(\w{2,3})(\.\w{2,3})?(\/\S*)?$

As mentioned in the comments however, this still matches the domain after a @.

Thanks

sideroxylon
  • 4,338
  • 1
  • 22
  • 40
  • This [question](http://stackoverflow.com/questions/641407/javascript-negative-lookbehind-equivalent) *may* help. – merlin2011 May 08 '14 at 23:02
  • 5
    Side note: Are you aware of the huge amount of new TLDs that are available or soon to be available? – Marty May 08 '14 at 23:02
  • Maybe look through the reference section at http://regexr.com/ – HJ05 May 08 '14 at 23:03
  • Thanks all. I'm aware of the deficiencies in my regex with regard to numbers in domains, new TLDs, etc - and I know how to fix that - but what I'm trying to do is simply find URLs in a text string and convert them to clickable links. Obviously I don't want to capture the domain of an e-mail address. If I replace `^` with `\b`, the presence of the @ no longer stops the email domain matching. – sideroxylon May 08 '14 at 23:21
  • well, why wouldn't you want to have email domains clickable? those are 99% of the time valid domain names that one can visit, even if most of the time it's not that interesting. – zmo May 08 '14 at 23:28
  • @zmo the reason I don't want to match the domain of the e-mail address is because a separate match function turns them into `mailto:` links. I need my URL matching function to leave the alone. Thanks. – sideroxylon May 08 '14 at 23:41

2 Answers2

1

that fails if the match is not at the start of the string

it's because of the ^ at the beginning of the match:

/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g

js> "www.foobar.com".match(/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g)
["www.foobar.com"]
js> "aoeuaoeu foobar.com".match(/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g)
[" foobar.com"]
js> "toto@aoeuaoeu foobar.com".match(/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g)
[" foobar.com"]
js> "toto@aoeuaoeu toto@foobar.com".match(/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g)
["foobar.com"]

though it's still matching a space before the domain. And it's making wrong assumptions about the domain…

  • xyz.example.org is a valid domain not matched by your regexp ;
  • www.3x4mpl3.org is a valid domain not matched by your regexp ;
  • example.co.uk is a valid domain not matched by your regexp ;
  • ουτοπία.δπθ.gr is a valid domain not matched by your regexp.

What defines a legal domain name? It's just a sequence of utf-8 characters separated by dots. It can't have two dots following each other, and the canonical name is \w\.\w\w (as I don't think a one letter tld exists).

Though, the way I'd do it is to simply match everything that looks like a domain, by taking everything that is text with a dot separator using word boundaries (\b):

/\b(\w+\.)+\w+\b/g

js> "aoe toto.example.org  uaoeu foo.bar aoeuaoeu".match(/\b(\w+\.)+\w+\b/g)
["toto.example.org", "foo.bar"]
js> "aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu".match(/\b(\w+\.)+\w+\b/g)
["example.org", "toto.example.org", "foo.bar"]
js> "aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu f00bar.com".match(/\b(\w+\.)+\w+\b/g)
["example.org", "toto.example.org", "foo.bar", "f00bar.com"]

and then make a second round to check whether the domain really exists or not in the list of domains found. The downside is that regexps in javascript can't check against unicode characters, and either \b or \w won't accept ουτοπία.δπθ.gr as a valid domain name.

In ES6, there's the /u modifier, which should work with latest browsers (but none that I have tested so far):

"ουτοπία.δπθ.gr aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu".match(/\b(\w+\.)+\w+\b/gu)

edit:

A negative lookbehind solves it - but obviously not in JS.

yes it will: for skipping all e-mail addresses, here's a working look behind implementation of the regex:

/(?![^@])?\b(\w+\.)+\w+\b/g

js> "aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu f00bar.com".match(/(?<![^@])?\b(\w+\.)+\w+\b/g)
["toto.example.org", "foo.bar", "f00bar.com"]

though it's the same as unicode… it'll be there in JS soon…

the only way around there is, is to actually preserve the @ in the matched regexp, and discard any match that contains an @:

js> "toto.net aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu f00bar.com".match(/@?\b\w+\.+\w+\b/g).map(function (x) { if (!x.match(/@/)) return x })
["toto.net", (void 0), "toto.example", "foo.bar", "f00bar.com"]

or use the new list comprehension from ES6/JS1.7, which should be there in modern browsers…

[x for x of "toto.net aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu f00bar.com".match(/@?\b\w+\.+\w+\b/g) if (!x.match(/@/))];

one final update:

/@?\b(\w*[^\W\d]+\w*\.+)+[^\W\d_]{2,}\b/g

> "x.y tot.toc.toc $11.00 11.com 11foo.com toto.11 toto.net aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu f00bar.com".match(/@?\b(\w*[^\W\d]+\w*\.+)+[^\W\d_]{2,}\b/g).filter(function (x) { if (!x.match(/@/)) return x })
[ 'tot.toc.toc',
  '11foo.com',
  'toto.net',
  'toto.example.org',
  'foo.bar',
  'f00bar.com' ]
zmo
  • 24,463
  • 4
  • 54
  • 90
  • This fails validation on regex101.com, but seems to pass Firebug. Whatever the case, I updated it to prevent it catching things like '$1.50' - `/(?![^@])?(www\.)?(\w+\.)(\w{2,3})(\.\w{2,3})(\/\S*)?\b/g;`. If I place it before my e-mail matching code, it still matches the domain, and then the e-mail match fails. If I match the URL after the e-mail match, it works, but it seems to be doing a lot of work, as it matches both the text and the mailto href. At least everything works though. So, I'm not sure if this is an answer or not. Thanks anyhow. – sideroxylon May 09 '14 at 00:26
  • 1
    well, you should be matching emails and fqdn, and then filter out emails to your email transform code, and the domains to the domain transform code. that would make things simpler. Assuming www starts a domain is wrong, though. But a domain can't be only digits, it needs to have at least one letter within. And anyway, there's only one canonical way to test for domains: it's to actually check them against the DNS registry. – zmo May 09 '14 at 00:35
  • added a regex that weeds out invalid domains based on only digit tld, or only digit domain or one character tld. – zmo May 09 '14 at 00:46
0

After a lot of messing about, this ended up working (with a definite hat tip to @zmo's final comment):

var rx = /\b(www\.)?(\w*@)?([a-zA-Z\-]*\.)(com|org|net|edu|COM|ORG|NET|EDU)(\.au)?(\/\S*)?/g;
var link = txt.match(rx);
    if(link !== null) {
    for(var i = 0; i < link.length; i++) {
      if (link[i].indexOf('@') == -1) {
         //create link
       } else {
        //create mailto;
       }
       }
       }

I'm aware of the limitations with regard to sub-domains, TLDs, etc. (which@zmo has addressed above - and if you need to catch all URLs, I'd suggest you adapt that code), but that was not the main issue in my case. The code in my answer allows matches to URLs present in a text string without 'www.', without also catching the domain of an e-mail address.

sideroxylon
  • 4,338
  • 1
  • 22
  • 40