2

I am attempting to split up domains into different categories (Subdomain, Domain, TLD) and am having trouble..

I can't figure out a way to match any number of subdomains and not overtake my domain or TLD mathcing. I am using PCRE regex.

Current regex:

\s(?:(?<subdomain>[a-z0-9\-]*){0,1}\.){0,3}(?<domain>(?>([a-z0-9\-]+)))\.(?<tld>[a-z\.]{2,6})\s

Data set:

 apple.orange.banana.clevername.co.uk 
 strawberry.apple.orange.banana.clevername.co.uk 
 tangerine.com.au
 simple.com

Note: There are spaces before and after the domains and they will always be lower case.

An example of how this data would match:

apple.orange.banana.clevername.co.uk

subdomain: apple.orange.banana
domain: google
tld: co.uk

If I add another fruit to the subdomain(strawberry.apple.orange.banana.clevername.co.uk), the match will fail. If I modify the {0,3} for the subdomain regex to a higher number or an unlimited number of matches, it gets too greedy and I no longer end up with a correct match for a domain/tld. Example of this:

Modified regex:

\s(?:(?<subdomain>[a-z0-9\-]*){0,1}\.){0,5}(?<domain>(?>([a-z0-9\-]+)))\.(?<tld>[a-z\.]{2,6})\s

Resulting match with new regex:

strawberry.apple.orange.banana.clevername.co.uk

subdomain: strawberry.apple.orange.banana.clevername
domain:
tld: co.uk

I'm sure the regex isn't the most efficient either so any help or suggestions would be greatly appreciated. Thanks!

Community
  • 1
  • 1
Stefan
  • 21
  • 2

3 Answers3

0

I believe this should do it for you:

\s((?<subdomain>[a-z0-9\.\-]*)\.)?(?<domain>[a-z0-9\-]{3,}(?=\.[a-z\.]{3,6}))\.(?<tld>[a-z\.]{3,6})\s

Tested this in Splunk and it works with your test data set.

Do note that this won't work for very short domains like bit.ly because there is no way to tell the domain from the subdomain without doing a lookup of the TLD.

For example, compare something.bit.ly and clevername.com.au. Without outside information, there is no way to tell that bit and clevername are the domains.

Syon
  • 7,205
  • 5
  • 36
  • 40
0

I recently came across the same problem. So I took Syon's regex and modified it a bit. This is the result:

\s(?:(?<subdomain>[a-z0-9\.\-]*)\.)?(?<domain>(?!com)[a-z0-9\-]{3,}(?=\.[a-z\.]{2,}))\.(?:(?<tld>[a-z\.]{2,})$)\s

It works on the whole test data set (I trimmed the spaces though), as well as short domains like bit.ly. Also works for new top level domains like .cancerresearch. See result: https://regex101.com/r/nX6yQ7/4

Note: The regex specifically states that the domain can't be com, this needs to be updated if other {3 characters}.xyz tlds need to be supported

zapdev
  • 161
  • 3
  • 10
0

You could try to find the longest suffix of the domain which is still listed in the Public Suffix List. After that, splitting the string should be easy.

Note that the list also considers domains of web hosters a public suffix. For example, in example.blogspot.com the public suffix is considered to be blogspot.com, not com. Also the list has to be parsed carefully as it contains comments and exceptions.

Martin
  • 2,573
  • 28
  • 22