4

Google+ seems to use The-King-of-URL-Regexes to parse the suckers out of user posts. It doesn't require protocols and is good about ignoring punctuation. For example: if I post "I like plus.google.com.", the site will transform that into "I like plus.google.com." So if anyone knows of a regex that can parse URLs both with and without protocols and is good at ignoring punctuation, please answer with it.

I don't think this question is a dupe, because all the answers I've seen to similar questions seem to require a protocol in the URL.

Thanks

JoshNaro
  • 2,047
  • 2
  • 20
  • 40

3 Answers3

2

Here's a more complete (full URL) implementation. Note that it is non fully RFC 3986 compliant, missing some TLDs, allows some illegal country TLDs, allows dropping the protocol part (as requested in the original Q), and has some other imperfections. The upside is that it has a lot of simplicity and is much shorter than many other implementations and does >95% of the job.

#!/usr/bin/perl -w
# URL grammar, not 100% RFC 3986 but pretty good considering the simplicity.
# For more complete implementation options see:
#   http://mathiasbynens.be/demo/url-regex
#   https://gist.github.com/dperini/729294
#   https://github.com/garycourt/uri-js (RFC 3986 compliant)
#
my $Protocol = '(?:https?|ftp)://';
# Add more new TLDs for completeness
my $TLD = '(?:com|net|info|org|gov|edu|[a-z]{2})';
my $UserAuth = '(?:[^\s:@]+:[^\s@]*@)';
my $HostName = '(?:(?:[-\w]+\.)+?' . ${TLD} . ')';
my $Port = '(?::\d+)';
my $Pathname = '/[^\s?#&]*';
my $Arg = '\w+(?:=[^\s&])*';
my $ArgList = "${Arg}(?:\&${Arg})*";
my $QueryArgs = '\?' . ${ArgList};
my $URL = qr/
    (?:${Protocol})?    # Optional, not per RFC!
    ${UserAuth}?
    ${HostName}
    ${Port}?
    (?:${Pathname})?
    (?:${QueryArgs})?
/sox;

while (<>) {
    while (/($URL)/g) {
         print "found URL: $&\n";
    }
}
arielf
  • 5,802
  • 1
  • 36
  • 48
1

A reasonable strategy would be to use a regexp to match top level domains (TLD) preceded by a dot, and then run a known-host table lookup or DNS query as a verification step on the suspected hostname string.

e.g. here's a session using perl demonstrating the first part of the strategy:

$ cat hostname-detector
#!/usr/bin/perl -w
# Add more country/new TLDs for completeness
my $TLD = '(?:com|net|info|org|gov|edu)';
while (<>) {
    while (/((?:[-\w]+\.)+?$TLD)/g) {
         print "found hostname: $&\n";
    }
}


$ ./hostname-detector
"I like plus.google.com."
found hostname: plus.google.com

a sentence without a hostname.

here's another host: free.org
found hostname: free.org

a longer.host.name.psu.edu should work too.                    
found hostname: longer.host.name.psu.edu

a host.with-dashes.gov ...
found hostname: host.with-dashes.gov
arielf
  • 5,802
  • 1
  • 36
  • 48
  • The end goal is to hit the site and retrieve metadata, so a target verification step will happen. However, I would want all valid URLs to be detected; including forward slashes, query strings, and all the other goodies that URLs tend to contain. – JoshNaro Feb 05 '13 at 13:49
0

@arielf

It looks to me that the following line:

my $HostName = '(?:(?:[-\w]+\.)+?' . ${TLD} . ')';

should be fixed this way:

my $HostName = '(?:(?:[-\w]+\.)+' . ${TLD} . ')';

Otherwise, the input http://www.google.com gets parsed as

found URL: http://www.go
found URL: ogle.com
aixtal
  • 1
  • 2