0

I'm working with some code used to try and find all the website URLs within a block of text. Right now we've already got checks that work fine for URLs formatted such as http://www.google.com or www.google.com but we're trying to find a regex that can locate a URL in a format such as just google.com

Right now our regex is set to search for every domain that we could find registered which is around 1400 in total, so it looks like this:

/(\S+\.(COM|NET|ORG|CA|EDU|UK|AU|FR|PR)\S+)/i

Except with ALL 1400 domains to check in the group(the full thing is around 8400 characters long). Naturally it's running quite slowly, and we've already had the idea to simply check for the 10 or so most commonly used domains but I wanted to check here first to see if there was a more efficient way to check for this specific formatting of website URLs rather than singling every single one out.

Machavity
  • 30,841
  • 27
  • 92
  • 100
Jsilver
  • 11
  • 1
  • Can you post sample text? because just looking at this it seems `\S+[a-z]\.[a-z]\S+` with `i` flag might work. – ctwheels Jan 11 '18 at 21:03
  • 1
    appears to be a duplicate of https://stackoverflow.com/questions/1141848/regex-to-match-url – Paul Sturm Jan 11 '18 at 21:03
  • Why have you added 1400 domains to a regex? If you hardcode domains like that you can/should use strpos which is much faster. Regex is for patterns, domains is not patterns. – Andreas Jan 11 '18 at 21:14
  • Note that many of the new TLDs are common words, so if somebody simply leaves out a space after a dot, there's a decent chance you'll erroneously pick it out. E.g., "our store is the best.name your own price!" Note that best.name is a valid domain but should not be a link in this context. – Alex Howansky Jan 11 '18 at 21:18
  • Possible duplicate of [Regex to match URL](https://stackoverflow.com/questions/1141848/regex-to-match-url) – ctwheels Jan 11 '18 at 21:24
  • Use RegexFormat to create a [ternary tree](http://www.regexformat.com/version7_files/Rx5_ScrnSht01.jpg) regex. It's faster than lightning, way faster than any other way. Give me the file and I'll make it for you.. –  Jan 11 '18 at 21:37
  • I've tested this [100,000 domain names regex](http://www.regexformat.com/Dnl/_Samples/_Ternary_Tool%20(Dictionary)/___txt/_100_000_Domain_Names.txt) it finds 400,000 names per second. It's about 1.5 MB. Of course, there are about 300 million domain names out there. –  Jan 12 '18 at 00:30

2 Answers2

0

You could use a double pass search.

Search for every url-like string, e.g.:

((http|https):\/\/)?([\w-]+\.)+[\S]{2,5}

On every result do some non-regex checks, like, is the length enough, is the text after the last dot part of your tld list, etc.

function isUrl($urlMatch) {
    $tldList = ['com', 'net'];
    $urlParts = explode(".", $urlMatch);
    $lastPart = end($urlParts);
    return in_array($lastPart, $tldList); 
}
Máthé Endre-Botond
  • 4,826
  • 2
  • 29
  • 48
0

Example

function get_host($url) {
    $host = parse_url($url, PHP_URL_HOST);
    $names = explode(".", $host);

    if(count($names) == 1) {
        return $names[0];
    }

    $names = array_reverse($names);
    return $names[1] . '.' . $names[0];
}

Usage

echo get_host('https://google.com'); // google.com
echo "\n";
echo get_host('https://www.google.com'); // google.com
echo "\n";
echo get_host('https://sub1.sub2.google.com'); // google.com
echo "\n";
echo get_host('http://localhost'); // localhost

Demo

odan
  • 4,757
  • 5
  • 20
  • 49