I'm trying to extract the domain + subdomain from any URL (without the full URL
suffix or http
and www
prefix).
I have the following lists of domains:
p.io -> p.io
amazon.com -> amazon.com
d.amazon.ca -> d.amazon.ca
domain.amazon.co.uk -> domain.amazon.co.uk
https://regex101.com/ -> regex101.com
www.regex101.comdddd -> regex101.com
www.wix.com.co -> wix.com.co
https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions -> stackoverflow.com
smile.amazon.com -> smile.amazon.com
I'm using the following regex to extract domain + subdomain:
[^w.\:\/]+[a-zA-Z\.]?\.[a-zA-Z]{1,3}(\.[a-zA-Z]{1,3})?(\.[a-zA-Z]{0,3})?|[w]{1,2}[^w.]+\.[a-zA-Z]{1,3}(\.[a-zA-Z]{1,3})?
The issue is that it is splitting several domains into two such as: d.amazon.ca
-> d.ama
+ zon.ca
and matching some non domain text such as: what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions
as seen in image below:
How can I force the regex to be greedy in the sense that it matches the full domain as a single match?
I'm using Java.