2

I am attempting to retrieve the sizes various websites whose URL will be passed to my script, but I'm not getting an exception when I pass an invalid URL, instead simply getting a very small page. I'm using Source.fromURL, and I get the following results:

thisIsClearlyABoggusURLThatCantPossiblyLeadAnyway 1052
www.bbc.co.uk 113871

The first one, as it says, shouldn't have anything in it, but it does. My script is as follows:

def main( args:Array[String] ){
    val tasks = for(arg <- args) yield future {
        try {
            println(arg + " " + Source.fromURL( attachPrefix(arg) ).length)
        } catch {
            case e : java.net.UnknownHostException => println(arg + " *")
        }
    }

    awaitAll(20000L, tasks: _*)
}

def attachPrefix( url:String ) = url.slice(0, 4) match {
    case "http" => url
    case "www." => "http://" + url
    case _ => "http://www." + url
}

Each argument is being passed into the function attachPrefix to ensure it has the necessary prefix before being used. This problem has only come about since I started passing the url in as a parameter instead of mapping it onto the arg, which is what I was doing earlier with

args map attachPrefix

What's the difference between the two, and why is my current one giving such behaviour?

  • You can use [java's approach](http://stackoverflow.com/q/2230676/298389) – om-nom-nom Feb 27 '13 at 20:54
  • Thanks a lot for the suggestion. I didn't know about that one :) In this case however, I need it to be in pure Scala. –  Feb 27 '13 at 20:56
  • 1
    `scala.io.Source.fromURL("http://www.thisIsClearlyABoggusURLThatCantPossiblyLeadAnyway")` throws `java.net.UnknownHostException`. I am wondering what your code that retrieves the size is doing exactly? Do you have a `toString` in your code maybe – and are actually retrieving the length of the exception text, maybe...? – Hbf Feb 27 '13 at 21:48
  • 1
    @Hbf, after reading your comment I played around with the code and found out what was causing the problem, but I don't have any idea what it's causing it. I've updated my question with the full code. –  Feb 28 '13 at 21:00

1 Answers1

1

You can use the Source.fromURL(URI) signature. Creating a URI will effectively validate the URL as documented here. However, in this case, the URL http://www.thisIsClearlyABoggusURLThatCantPossiblyLead‌​Anyway is valid as far as the URI is concerned. On the other hand, the UrlValidator suggested by om-nom-nom considers it invalid, because the top level domain segment has more than 4 characters which is already out of date. I don't know of any entirely Scala validation libraries or why this would be a requirement, but you could try using a regular expression for validation. For example, this will catch your example, because the top level domain exceeds 6 letters:

val re = """^(https?://)?(([\w!~*'().&=+$%-]+: )?[\w!~*'().&=+$%-]+@)?(([0-9]{1,3}\.){3}[0-9]{1,3}|([\w!~*'()-]+\.)*([\w^-][\w-]{0,61})?[\w]\.[a-z]{2,6})(:[0-9]{1,4})?((/*)|(/+[\w!~*'().;?:@&=+$,%#-]+)+/*)$""".r
re.pattern.matcher("http://google.com").matches // true
re.pattern.matcher("http://www.thisIsClearlyABoggusURLThatCantPossiblyLeadAnyway").matches // false
yakshaver
  • 2,472
  • 1
  • 18
  • 21