2

I have tried the regex from this question : how to get domain name from URL

But the domain name is not being found. Here is my implementation :

    val Names = """.*([^\.]+)(com|net|org|info|coop|int|co\.uk|org\.uk|ac\.uk|uk|__and so on__)$""".r
    val s = Names.findFirstIn("www.google.com")
    s match {
    case Some(name) =>
        println(name)
    case None =>
        println("No name value")
    }

"No name value" is consistently printed to std out. Is there an issue with the regex or my Scala implementation ?

Community
  • 1
  • 1
blue-sky
  • 51,962
  • 152
  • 427
  • 752

3 Answers3

2

I fixed the regex by adding a . before the extension. BTW, since you must get the group that interests you (the #1), you should use findFirstMatchIn instead of findFirstIn.

val Names = """([^.]+)\.(com|net|org|info|coop|int|co\.uk|org\.uk|ac\.uk|uk|__and so on__)$""".r
val s = Names.findFirstMatchIn("www.google.com")
s match {
case Some(name) =>
  println(name)
  println(name.group(1))
case None =>
    println("No name value")
}

Prints:

google.com
google
Names: scala.util.matching.Regex = ([^.]+)\.(com|net|org|info|coop|int|co\.uk|org\.uk|ac\.uk|uk|__and so on__)$
s: Option[scala.util.matching.Regex.Match] = Some(google.com)

EDITED: sorry I misread your question. I rewrote the answer.

Stephane Godbillon
  • 1,856
  • 13
  • 12
2

I would use Scalas 2.10 string interpolation feature:

implicit class Regex(sc: StringContext) {
  def r = new util.matching.Regex(sc.parts.mkString, sc.parts.tail.map(_ => "x"): _*)
}

scala> "www.google.co.uk" match {
      case  r"(.*?)$sld([^.]+)$domain\.(com|net|org|co\.uk)$tld" => (sld,domain,tld)
      case _ => ???
    }
res61: (String, String, String) = (www,google,co.uk)

The problem with this approach is that you always need to capture each group with a variable. To disable this, you need to add explicitly a non capturing group (starts with ?:):

r".*?([^.]+)$domain\.(?:com|net|org|co\.uk)"

For the first group it is also possible to leave it out completely.

It is also possible to leave out the not-matched part of the pattern match if you are sure that it is possible to always match the input strings:

scala> val r".*?([^.]+)$domain\.(?:com|net|org|co\.uk)" = "www.google.com"
domain: String = google
kiritsuku
  • 52,967
  • 18
  • 114
  • 136
1
scala> val Names = """.*?([^\.]+)\.(?:com|net|org|info|coop|int|co\.uk|org\.uk|ac\.uk|uk|__and so on__)""".r
Names: scala.util.matching.Regex = .*?([^\.]+)\.(?:com|net|org|info|coop|int|co\.uk|org\.uk|ac\.uk|uk|__and so on__)

scala> val Names( primary ) = "www.google.com"
primary: String = google

Changes:

  • Note the ? after the initial .* -- greedy matching can match all the way to e.com, so turn it off!
  • Add '.' between the group you want and the (com|net...) section. you expect dot to be a boundary there
  • you don't want the (com|net...) section to define a capturing group, so use (?:...) rather than just (...)
  • I removed the $ at the end. That was probably gratuitous.

Good luck!

Steve Waldman
  • 13,689
  • 1
  • 35
  • 45