0

I want to extract domain name from uri.

For example, input to the regular expression may be of one of the below types

  1. test.net
  2. https://www.test.net
  3. https://test.net
  4. http://www.test.net
  5. http://test.net

in all the cases the input should return test.net

Below is the code in implemented for my purpose

    val re = "([http[s]?://[w{3}\\.]?]+)(.*)".r

But I didn't get expected result

below is my output

val re(prefix, domain) = "https://www.test.net"

prefix: String = https://www.t

domain: String = est.net

what is problem with my regular expression and how can I fix it?

user51
  • 8,843
  • 21
  • 79
  • 158
  • 1
    The dot after 'www' should be escaped. Also, you have square brackets around the whole thing before the plus sign – user Dec 13 '19 at 22:17
  • okay I've updated it still the same error – user51 Dec 13 '19 at 22:21
  • And you're still using square brackets where you should use parentheses. The square brackets only match 1 of those chars, while parens match the entire group. I don't understand your regex but this should at least get you a bit further: "(http(s)?://(w{3}\\.)+?)([^.]*)" – user Dec 13 '19 at 22:32
  • still same error for your regular expression above ```val re(prefix, domain) = "https://www.test.net" prefix: String = https://www.t domain: String = est.net``` – user51 Dec 13 '19 at 22:36
  • So your domain name is just everything after "www." right? yes – user51 Dec 13 '19 at 22:40
  • You don't need regex for this. See [this question](https://stackoverflow.com/questions/17736681/how-to-parse-or-split-url-address-in-java) and [this code sample](https://rosettacode.org/wiki/URL_parser#Scala) to see how it can be done using java's URL / URI parser – jrook Dec 13 '19 at 22:41
  • Does this answer your question? [Parse a URI String into Name-Value Collection](https://stackoverflow.com/questions/13592236/parse-a-uri-string-into-name-value-collection) – jrook Dec 13 '19 at 23:15
  • @jrook - No. I'm not looking at that solution. I got solution in answers that I'm expecting. – user51 Dec 14 '19 at 00:03

1 Answers1

3

what is problem with my regular expression and how can I fix it?

You are using a character class

[http.?://(www.)?]

This means:

  • either an h
  • or a t
  • or a t
  • or a .
  • or a ?
  • or a :
  • or a /
  • or a /
  • or a (
  • or a w
  • or a w
  • or a w
  • or a .
  • or a )
  • or a ?

It does not include an s, so it will not match https://.

It is not clear to me why you are using a character class here, nor why you are using duplicate characters in the class.

Ideally, you shouldn't try to parse URIs yourself; someone else has already done the hard work. You could, for example, use the java.net.URI class:

import java.net.URI

val u1 = new URI("test.net")
u1.getHost
// res: String = null

val u2 = new URI("https://www.test.net")
u2.getHost
// res: String = www.test.net

val u3 = new URI("https://test.net")
u3.getHost
// res: String = test.net

val u4 = new URI("http://www.test.net")
u4.getHost
// res: String = www.test.net

val u5 = new URI("http://test.net")
u5.getHost
// res: String = test.net

Unfortunately, as you can see, what you want to achieve does not actually comply with the official URI syntax.

If you can fix that, then you can use java.net.URI. Otherwise, you will need to go back to your old solution and parse the URI yourself:

val re = "(?>https?://)?(?>www.)?([^/?#]*)".r

val re(domain1) = "test.net"
//=> domain1: String = test.net

val re(domain2) = "https://www.test.net"
//=> domain2: String = test.net

val re(domain3) = "https://test.net"
//=> domain3: String = test.net

val re(domain4) = "http://www.test.net"
//=> domain4: String = test.net

val re(domain5) = "http://test.net"
//=> domain5: String = test.net
Jörg W Mittag
  • 363,080
  • 75
  • 446
  • 653
  • Except the first case (which is just two strings with a `.` between them) all others can be acquired using `URI` + a check to remove the beginning `www.`. this regex will match "hello. Good morning" while URI will not allow that. – jrook Dec 14 '19 at 00:54
  • 1
    The problem is that the OP expects in all cases the domain part of the host part of the URI to be `test.net`. However, that is actually only true for cases #3 and #5, where the host is `www` and the domain is indeed `test.net`. In case #2 and #4, the FQDN of the host part is just `net`, and in case #1, the URI doesn't even have a host part at all, it only has a path. So, trying to parse this with an URI parser doesn't work, because the OP's parsing rules are *different* from RFC 2396. – Jörg W Mittag Dec 14 '19 at 04:00
  • 1
    Since the OP's parsing rules do not follow any official specification, and the OP haven't given their parsing rules, who is to say that "hello. Good morning" *isn't* a valid URI according to their rules? – Jörg W Mittag Dec 14 '19 at 04:04