-1

I'm trying to extract the domain + subdomain from any URL (without the full URL suffix or http and www prefix).

I have the following lists of domains:

p.io -> p.io
amazon.com -> amazon.com
d.amazon.ca -> d.amazon.ca
domain.amazon.co.uk -> domain.amazon.co.uk
https://regex101.com/ -> regex101.com
www.regex101.comdddd -> regex101.com
www.wix.com.co -> wix.com.co
https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions -> stackoverflow.com
smile.amazon.com -> smile.amazon.com

I'm using the following regex to extract domain + subdomain:

[^w.\:\/]+[a-zA-Z\.]?\.[a-zA-Z]{1,3}(\.[a-zA-Z]{1,3})?(\.[a-zA-Z]{0,3})?|[w]{1,2}[^w.]+\.[a-zA-Z]{1,3}(\.[a-zA-Z]{1,3})?

The issue is that it is splitting several domains into two such as: d.amazon.ca -> d.ama + zon.ca and matching some non domain text such as: what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions as seen in image below:

enter image description here

How can I force the regex to be greedy in the sense that it matches the full domain as a single match?

I'm using Java.

  • 1
    Do you absolutely need to use a single regex? If your input is always a URL, this is easier without regex. – Ry- Mar 27 '22 at 04:30
  • you can use [java.net.URL](https://docs.oracle.com/javase/7/docs/api/java/net/URL.html), it has `getHost()` method to return the host you wanted. remember to enclose it within try-catch. – Bagus Tesa Mar 27 '22 at 04:50
  • @Ry- _Do you absolutely need to use a single regex_ yes, since the text may not be formatted appropriately and may contain gibberish at beginning or end so we would need regex to extract it.. – Agustin Netto Mar 27 '22 at 05:00

1 Answers1

1

I'd use the standard URI class instead of a regular expression to parse out the domain:

import java.net.URI;
import java.net.URISyntaxException;
import java.util.Optional;

public class Demo {
    private static Optional<String> getHostname(String domain) {
        try {
            // Add a scheme if missing
            if (domain.indexOf("://") == -1) {
                domain = "https://" + domain;
            }
            URI uri = new URI(domain);
            return Optional.ofNullable(uri.getHost()).map(s -> s.startsWith("www.") ? s.substring(4) : s);
        } catch (URISyntaxException e) {
            return Optional.empty();
        }
    }

    public static void main(String[] args) {
        String[] domains = new String[] {
            "p.io",
            "amazon.com",
            "d.amazon.ca",
            "domain.amazon.co.uk",
            "https://regex101.com/",
            "www.regex101.comdddd", // .comdddd is (potentially) a valid TLD; not sure why your output removes the d's                                                                                                                            
            "www.wix.com.co",
            "https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions",
            "smile.amazon.com"
        };
        for (String domain : domains) {
            System.out.println(getHostname(domain).orElse("hostname not found"));
        }
    }
}

outputs

p.io
amazon.com
d.amazon.ca
domain.amazon.co.uk
regex101.com
regex101.comdddd
wix.com.co
stackoverflow.com
smile.amazon.com
Shawn
  • 47,241
  • 3
  • 26
  • 60
  • @_Shawn Thanks for your response. _.comdddd is (potentially) a valid TLD; not sure why your output removes the d's_ -- **regex101.comdddd** is also valid or just **regex101.dddd** (without the com part)? – Agustin Netto Mar 27 '22 at 05:21
  • @AgustinNetto .dddd is also a potential TLD. (There's over 1500 of them registered and counting.) – Shawn Mar 27 '22 at 05:27
  • But when I enter `www.regex101.comdddddddddddddd` it extracts `www.regex101.comdddddddddddddd` Are all of these valid TLDs? It doesn't make sense.. it essentially extracts **any** value after the `.` – Agustin Netto Mar 27 '22 at 05:47
  • Well, yes. That's how domain names work. – Shawn Mar 27 '22 at 05:50
  • hmm.. Interesting. Would be curious to see the source for this. When I go to [Wikipedia](https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains#D) I don't see `dddddddddddddd`, `comdddddddddddddd` or even `dddd` as one of the options. Even if it were somehow "legitimate" I don't think it would be used practically so I would need to filter it out somehow.. `co.uk` or `com.uk` is fine but not `.comddddddddd` maybe this is why I need regex.. – Agustin Netto Mar 27 '22 at 05:55
  • A / separates the name from the start of the path. – Shawn Mar 27 '22 at 05:56
  • @AgustinNetto, it would be moving the goal post from extracting domain from url (using regex) to validating TLDs. Validating TLD require your app to actually asks domain name resolver if that particular domain actually exists. even regex wont sove your `.comdddd` problem, we have `.info` too, your 3 character limit wont capture it properly. the rfc 1034 says it can be 63 characters long, so `.comdddddddddd` is a valid one. though, if you insist, you can just use `commons-validator` specifically the domain validator. – Bagus Tesa Mar 27 '22 at 11:26