Regex to match a fully qualified hostname or URL with optional https

Question

2 possible strings contained in a log file:

1) "some text then https://myhost.ab.us2.myDomain.com and then some more text"

OR:

2) "some text then myhost.ab.us2.myDomain.com and then some more text"

The "myDomain.com" is constant, so we can look for that hard-coded in the regex.

In both cases, they are not at the start of the line, but in the middle.

Need to extract "myhost" out of the line, if it matches.

I've tried positive look behind using "https://" OR "\\s{1}". The https:// by itself works:

Matcher m = Pattern.compile("https://(.+?)\\.(.+?)\\.(.+?)\\.myDomain\\.com\\s").matcher(input);

I'm want to add an "or" in there so it matches with "https://" or "<space>" ("https://|//s{1}"), but it always grabs the entire string up to the start of the first space.

For now, I've settled on splitting the string into String[] and checking if it contains "myDomain". I worked so long on this I wanted to learn what the best answer is.

When you say _"Need to extract "myhost" out of the line"_ what are you trying to get as a result? A `String` that contains "myhost" (or whatever else the hostname might be)? e.g. `String name = extractNameFrom(logLine);` ? — Stephen P, May 21 '20 at 00:06

ggorlen · Answer 1 · 2020-05-21T15:05:27.017

I'd use something like

\b(?:https?:\/\/)?(\w+)\.(?:\w+\.)*myDomain\.com

This matches an optional https:// prefix followed by your host which is captured, followed by some other subdomains (you could specify how many with {2} or hardcode them in, if you know it's always ab.us2), then myDomain.com.

In Java 10:

import java.util.Arrays;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        var text = "some text then https://myhost.ab.us2.myDomain.com " + 
                   "and then some more text some text then " +
                   "myhost.ab.us2.myDomain.com and then some more text";
        var pat = "\\b(?:https?://)?(\\w+)\\.(?:\\w+\\.)*myDomain\\.com";
        var matches = Pattern.compile(pat)
            .matcher(text)
            .results()
            .map((m) -> m.group(1))
            .toArray(String[]::new);
        System.out.println(Arrays.toString(matches)); // => [myhost, myhost]
    }
}

In Java 8:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        String text = "some text then https://myhost.ab.us2.myDomain.com " + 
                      "and then some more text some text then " +
                      "myhost.ab.us2.myDomain.com and then some more text";
        String pat = "\\b(?:https?://)?(\\w+)\\.(?:\\w+\\.)*myDomain\\.com";
        Matcher matcher = Pattern.compile(pat).matcher(text);

        while (matcher.find()) {
            System.out.println(matcher.group(1)); // => myhost myhost
        }
    }
}

Could not get this to compile as-is. We're using JDK8.. Matcher.class doesn't have "results" method. Not sure why you put in "var" instead of the actual classes? — kmantic, May 21 '20 at 12:40
I ran this on Java 10 [here](https://repl.it/repls/OutrageousForkedNumerators) but I also added a Java 8 version to the post. You can read about the [`var` keyword](https://stackoverflow.com/questions/3443858/what-is-the-equivalent-of-the-c-sharp-var-keyword-in-java). — ggorlen, May 21 '20 at 15:10

maio290 · Answer 2 · 2020-05-21T00:27:09.347

I just put in a non-regex approach:

public static String extractHost(String logEntry, String domain)
{

    logEntry = logEntry.toLowerCase(); -> not needed, just a hint to remember case sensitive stuff ;)

    if(logEntry.indexOf("https://") != -1)
    {
        // contains protocol, must be variant one
        return logEntry.substring(logEntry.indexOf("https://")+8,logEntry.indexOf("."));
    }

    //  has to be variant two
    int domainIndex = logEntry.indexOf(domain);

    if(domainIndex == -1) return null;

    int previousDotIndex = -1;

    for(int i = domainIndex; i>= 0; i--)
    {
        if(logEntry.charAt(i) == '.') previousDotIndex = i;
        if(logEntry.charAt(i) == ' ') return logEntry.substring(++i,previousDotIndex);
    }

    return null;
}

The variant #2 is actually the more difficult one, in this approach you just iterate from the domain's index back to the first whitespace found and store the position of the most recent dot found. Then it's just a simple substring.

Regex to match a fully qualified hostname or URL with optional https

2 Answers2