Extract links from web page in Scala

Question

I want to extract the links from a web age to create a tree of links that have the same host. I found this regular expression, but it doesn't work really well...Have you some ideas ?

val r = "<a href =\"([^\"]*)\"".r

I looked at this page too, but it's in Java and the "translation" in Scla is pretty hard to get

Extract links from a web page

Ok thanks, installed it. I'll try to do something with that. Will come back if needed. Still wondering what is the good regex for my problem ! — Tetra, Jun 06 '18 at 15:51

score 0 · Answer 1 · answered Jun 22 '23 at 09:08

I used the following version. Keep in mind that without full DOM parsing, results won't be very reliable.

import scala.collection.mutable.ArrayBuffer
import java.util.regex.Pattern

def extractLinks(htmlContent: String): List[String] = {
  val result: ArrayBuffer[String] = ArrayBuffer[String]()
  val regex = "(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";
  val p = Pattern.compile(regex);
  val m = p.matcher(htmlContent);
  while (m.find()) {
    result += m.group();
  }
  result.toList
}

Extract links from web page in Scala

1 Answers1