3

So im using this function here:

function get_domain($url)
{
  $pieces = parse_url($url);
  $domain = isset($pieces['host']) ? $pieces['host'] : '';
  if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
    return $regs['domain'];
  }
  return false;
}

$referer = get_domain($_SERVER['HTTP_REFERER']);

And what i need is another regex for it, if someone would be so kind to help. Exactly what i need is for it to get the whole domain, including subdomains.

Lets say as a real problem i have now. When people blogging link from example: myblog.blogger.com The referer url will be just blogger.com, which is not ideal..

So if someone could help me so i can get the including subdomain as regex code for the function above, id apreciate it alot!

Thanks!

Andreas
  • 117
  • 1
  • 2
  • 8

4 Answers4

12

This regex should match a domain in a string, including any dubdomains:

/([a-z0-9|-]+\.)*[a-z0-9|-]+\.[a-z]+/

Translated to rough english, it functions like this: "match the first part of the string that has 'sometextornumbers.sometext', and also include any number of 'sometextornumbers.' that might preceed it.

See it in action here: http://regexr.com?2vppk

Note that the multiline and global flags in that link are only there to be able to match the entire blob of test-text, so you don't need if you're passing only one line to the regex

Jarmund
  • 3,003
  • 4
  • 22
  • 45
1

Good luck with the above as Domain names now contain non-roman characters. These would have to be processed into equivalent but unique ascii before regex could work reliably. See RFC 3490 Internationalizing Domain Names in Applications (IDNA) ... See https://www.rfc-editor.org/rfc/rfc3490 which has

Until now, there has been no standard method for domain names to use
characters outside the ASCII repertoire. This document defines
internationalized domain names (IDNs) and a mechanism called
Internationalizing Domain Names in Applications (IDNA) for handling
them in a standard fashion. IDNs use characters drawn from a large
repertoire (Unicode), but IDNA allows the non-ASCII characters to be
represented using only the ASCII characters already allowed in so-
called host names today. This backward-compatible representation is
required in existing protocols like DNS, so that IDNs can be
introduced with no changes to the existing infrastructure. IDNA is
only meant for processing domain names, not free text.

Community
  • 1
  • 1
Fred Gannett
  • 111
  • 1
  • 4
-1

I guess this is an optimization for the first suggestion. The main improvements:

  • does not react to invalid pattern sub..domain.xyz
  • captures more that one sub-domain as group
  • captures port if given
https://((?:[a-z0-9-]+\.)*)([a-z0-9-]+\.[a-z]+)($|\s|\:\d{1,5})

Test it: https://regex101.com/r/njFIil/1

This regex does not handle any unicode symbols, which could be a problem as mentioned above.

chris
  • 742
  • 5
  • 7
  • Your expression does not work without a port number, as show here: https://regex101.com/r/RZSKc1/1 - you should make it optional. Also, adding a | in the brackets allows to use it. I created you many examples here: https://regex101.com/r/W93YdL/1 – Alexandre Salomé Mar 17 '21 at 19:08
  • Hey, thanks for your feedback! I improved the regex, it reacts correct to all of your tests except for your first 'invalid' domain, which doesn't seem wrong to me. – chris Mar 17 '21 at 20:45
-2

Better solution:

/^([a-z0-9|-]+[a-z0-9]{1,}\.)*[a-z0-9|-]+[a-z0-9]{1,}\.[a-z]{2,}$/

Regex sample: https://regexr.com/4k71a

And for email address:

/^[a-z0-9|.|-]+[a-z0-9]{1,}@([a-z0-9|-]+[a-z0-9]{1,}\.)*[a-z0-9|-]+[a-z0-9]{1,}\.[a-z]{2,}$/
  • Why and how is this better? It matches `|||||||a.zzzzzz` for example. Please, have a look at these sites: [TLD list](https://www.iana.org/domains/root/db); [valid/invalid addresses](https://en.wikipedia.org/wiki/Email_address#Examples); [regex for RFC822 email address](http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html) – Toto Sep 01 '19 at 14:21
  • this would be only valid for two-char toplevel domain names. what about "vienna", "berlin" or "com"? – SeriousM Feb 24 '20 at 21:34