0

I want to extract the links from a web age to create a tree of links that have the same host. I found this regular expression, but it doesn't work really well...Have you some ideas ?

val r = "<a href =\"([^\"]*)\"".r

I looked at this page too, but it's in Java and the "translation" in Scla is pretty hard to get

Extract links from a web page

Tetra
  • 1

1 Answers1

0

I used the following version. Keep in mind that without full DOM parsing, results won't be very reliable.

import scala.collection.mutable.ArrayBuffer
import java.util.regex.Pattern

def extractLinks(htmlContent: String): List[String] = {
  val result: ArrayBuffer[String] = ArrayBuffer[String]()
  val regex = "(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";
  val p = Pattern.compile(regex);
  val m = p.matcher(htmlContent);
  while (m.find()) {
    result += m.group();
  }
  result.toList
}
R A
  • 149
  • 1
  • 9