0

I'm trying to write a Java RegEx that will extract the domain name from a list of domains, sub-domains, and multi sub-domains.

There are too many domains to maintain with the RegEx I have written, and there is a lot more out there. https://publicsuffix.org/list/effective_tld_names.dat

What is a better way to capture the domain name? The goal is to remove the subdomain, extract the domain name so I can resolve or ping it.

This is the RegEx I have come up with

(\w*.(?:\.co|\.org|\.net|\.int|\.edu|\.gov|\.mil|\.arpa|\.tv|\.aero|\.asia).*)

Here is a sample list I am testing against.

comnettest.google.com
doubleclick.net
googleapis.com
imrworldwide.com
bom.gov.au
www.bom.gov.au
googleapis.com
www.google.com
www.twiiter.com
dynamic.t2.tiles.virtualearth.net 
domain.com
1-A.domain.com
1-A.2-B.domain.com
1-A.2-B.3-C.domain.com
mt0.google.com
twitch.tv
stream.twitch.tv
streamcom.com.au
network.google.com
jordanhill123
  • 4,142
  • 2
  • 31
  • 40
  • If you really want to use RegEx you should localize the last dot that is important and get the string between that dot and the previous dot. However the url may contain multiple dots since a url might look like "sasse.com/google.com" where sasse is the domain and "google.com" a page or folder on the domain. – Sasse Oct 03 '14 at 06:55
  • Hi Sasse, In my case I won't receive URL's like in your example. They will just be sub-domains like I have listed in my sample above.Thanks. – xsploit Oct 03 '14 at 07:03
  • What about an url like "com.com.com" ? You're not even getting url's like "google.com?search=fishy" ? Because then you could just do something like: url = url.substring(url.indexof('.')); // And do that as long as the string contains more then 1 '.'. Otherwise a RegEx pattern like ".*?([^.]+\\.[^.]+)" might work. – Sasse Oct 03 '14 at 07:13
  • I have just found a good RegEx pattern that is suitable in most cases. ([0-9A-Za-z]{2,}\.[0-9A-Za-z]{2,3}\.[0-9A-Za-z]{2,3}|[0-9A-Za-z]{2,}\.[0-9A-Za-z]{2,3})$ - From http://stackoverflow.com/questions/12772423/regex-match-main-domain-name – xsploit Oct 03 '14 at 07:26
  • Which one? The one I wrote or the one that's in the question? – Sasse Oct 03 '14 at 07:27
  • Okay, that's not a very nice or generic RegEx. What was wrong with ".*?([^.]+\\.[^.]+)" ? – Sasse Oct 03 '14 at 07:29

0 Answers0