Java regex expression for matching domain

Question

Am trying to work out a regex pattern which checks for the presence of a domain followed by / followed by any character. For example the string https://example.com/ is fine for me but I want to invalidate the string https://example.com/xyz as it has the domain followed by a path.

Currently I have come up with the pattern for checking a string that starts with https and followed by any charaters: https://(.*). But I have been unable to work out a pattern for the aforementioned scenario.

Thanks in advance for your inputs :)

why use regex? length of substring after last `/`? Create URL object and then getQuery? — Scary Wombat, Oct 17 '19 at 06:34
Well, think about what a domain looks like and what it can contain: characters, dashes, digits and dots - I might have forgotten something but at least they don't contain slashes.So try `http(s)?://[^/]*/?` — Thomas, Oct 17 '19 at 06:37
@PavelSmirnov yeah I put that in the 2nd para in my question — Trooper, Oct 17 '19 at 06:42
Possible duplicate: https://stackoverflow.com/questions/4093806/regexp-to-match-domain-and-subdomains-in-java — i.bondarenko, Oct 17 '19 at 07:06

score 0 · Accepted Answer · answered Oct 17 '19 at 06:45

0

You should set a pattern to start with http and may end with / without any / in the middle of string

^http(s)?://[^/]*/?$

answered Oct 17 '19 at 06:45

Vengleab SO

716
4
11

Cool this looks good.. so I actually want to validate for string starting with https.. so would this be fine - https://[^/]*/?$ – Trooper Oct 17 '19 at 06:51
Ran this through a matcher and works fine for `https://example.com/` but not for `https://example.com/xyz` – Ambro-r Oct 17 '19 at 07:16

score 0 · Answer 2 · answered Oct 17 '19 at 06:55

see RFC 3986 Appendix B (https://www.ietf.org/rfc/rfc3986.txt)

Appendix B. Parsing a URI Reference with a Regular Expression

As the "first-match-wins" algorithm is identical to the "greedy" disambiguation >method used by POSIX regular expressions, it is natural and commonplace to use a regular >expression for parsing the potential five components of a URI reference.

The following line is the regular expression for breaking-down a well-formed URI >reference into its components.
 ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
  12            3  4          5       6  7        8 9
The numbers in the second line above are only to assist readability; they indicate >the reference points for each subexpression (i.e., each paired parenthesis).

Ambro-r · Answer 3 · 2019-10-17T07:40:22.917

0

I would approach this in two steps, first I would match the domain with the following regex pattern

http(s)?://(?:[\w0-9](?:[\w0-9-]{0,61}[\w0-9])?\.)+[\w0-9][\w0-9-]{0,61}[\w0-9](/)?

Once you have the domain, then I would sub-string the rest and if there is more than just a "/" (i.e. "/xyz"), then invalidate the String as per your requirement.

For example:

    String urlString = "https://example.com/";
    String regex = "http(s)?://(?:[\\w0-9](?:[\\w0-9-]{0,61}[\\w0-9])?\\.)+[\\w0-9][\\w0-9-]{0,61}[\\w0-9](/)?";
    String[] url = urlString.split(regex);
    if(url.length > 1) {
        System.out.println(urlString + " has a path.");
    } else {
        System.out.println(urlString + " does not have a path.");
    }

edited Oct 17 '19 at 07:40

answered Oct 17 '19 at 06:59

Ambro-r

919
1
4
14

Do you realize that `http[s]` will match "https" _only_ and is the same as just `https`? – Thomas Oct 17 '19 at 07:28
Finger trouble, should have been `(s)?` to make it capturing group between zero and one. Updated. – Ambro-r Oct 17 '19 at 07:41
@Thomas, though if you are really being pedantic, this expression (nor any of the expressions proposed) will not work if the `http` contains any UpperCase characters, so ideally `.toLowerCase()` should be applied to the`urlString` first. – Ambro-r Oct 17 '19 at 07:45
Well, that's true :) - One could make the expression case-insensitive though, e.g. by prepending `(?i)` ;) – Thomas Oct 17 '19 at 07:52

score 0 · Answer 4 · answered Oct 17 '19 at 06:59

0

Here is a regex for filtering out urls that you need to invalidate.

^https?:\/\/(www\.)?([^:\/\n?]+)\/?$

Hope this helps !

answered Oct 17 '19 at 06:59

Lahiru Udana

105
1
8

score -1 · Answer 5 · answered Oct 17 '19 at 07:09

-1

Please use the below regex once. This might solve your issue:

http(s?)://[[a-zA-z]+\\.*\\/

answered Oct 17 '19 at 07:09

pks

31
6

`[A-z]` [matches more than just ASCII letters](https://stackoverflow.com/questions/29771901/why-is-this-regex-allowing-a-caret/29771926#29771926). – Wiktor Stribiżew Oct 17 '19 at 08:55

Java regex expression for matching domain

5 Answers5