Matching hrefs within String using Scala

Question

Using this regex (<a[^>]+>.+?<\\/a>) I'm attempting to print the matching links.

So t1,t2,t3 should be printed but nothing is printed :

val str = "tester<a href=\"t1\">this is just test text<a href=\"t2\">\\r\\t\\s<a href=\"t3\">"
val re = "(<a[^>]+>.+?<\\/a>)".r
for (p <- re findAllIn str) p match {
  case re(b) => print(b)
}

Is the regex or how the regex is implemented incorrect ?

Update :

Using accepted answer this will download all valid hrefs (begin with 'http') from a url, in this case https://news.ycombinator.com/:

import scala.io.Source
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import scala.collection.JavaConversions._

object Main extends App {

  val hrefs = getHrefsFromPage("https://news.ycombinator.com/");

  hrefs.foreach(e => println(e))

  def getHrefsFromPage(url: String): List[(String, String)] = {

    val doc = Jsoup.parse(Source.fromURL(url).mkString)
    val aTags = doc.select("a").iterator.toList
    val ts = (for (t <- aTags) yield (t.attr("href"), t.text))
    val fts = ts.filter(f => f._1.trim.startsWith("http"))

    return fts;
  }

}

Mandatory SO Link: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 . Read my answer to see an alternate way of parsing html. And Please don't parse html with regex. Html is not a regular language so regex parsing is not reliable (english language as used is finite, hence regular, so we use regex there). — faizan, Aug 18 '16 at 00:54

score 2 · Accepted Answer · edited May 23 '17 at 12:22

Read this SO Answer first please.

Now coming back.

You need to use a reliable html parser lib to parse html strings, regex won't be enough in most non-trivial cases.

Regex won't get the job done because

It is error prone, we make mistakes writing regex all the time, plus you are tge only verifier and maintainer (maintenance nightmare)
It is hard to maintain and document
It is hard to test, you will have to think of all possible test case strings for your regex and then write test cases for it.

Why an Html parser is better

Not error prone, has been verified by multiple contributors and users, unlike your regex which only you use and verify
Documented in its own site and javadoc
Html Parsing already tested in the library itself, you can focus on testing your app functionality or business use case.
CSS selectors and DOM structure to select and manipulate the Html. (This is the biggest benefit, you will need css selectors support for any serious html work.)

As a result of this, I would suggest you to use Jsoup html parser. Below I describe usage for your case.

First get the dependency or just download the jar. Maven dependency as below:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.9.2</version>
</dependency>

Next the imports

import org.jsoup.Jsoup
import org.jsoup.nodes.Document

Now parsing your html string

val str = "tester<a href=\"t1\">this is just test text<a href=\"t2\">\\r\\t\\s<a href=\"t3\">"
val doc = Jsoup.parse(str)

What this gives:

doc: org.jsoup.nodes.Document =
<html>
 <head></head>
 <body>
  tester
  <a href="t1">this is just test text</a>
  <a href="t2">\r\t\s</a>
  <a href="t3"></a>
 </body>
</html>

Notice the full structure generated with cleaned tags from your string.

Getting all <a> tags

val aTags = doc.select("a")

Result:

aTags: org.jsoup.select.Elements =
<a href="t1">this is just test text</a>
<a href="t2">\r\t\s</a>
<a href="t3"></a>

Getting all <a> tag string representation

val aTagsString = aTags.toString

Result:

aTagsString: String =
<a href="t1">this is just test text</a>
<a href="t2">\r\t\s</a>
<a href="t3"></a>

Getting first or 0th <a> tag

val firstATag = doc.select("a").get(0)

Result:

firstATag: org.jsoup.nodes.Element = <a href="t1">this is just test text</a>

Getting string representation of first <a> tag

val firstATagString = firstATag.toString

Result:

firstATagString: String = <a href="t1">this is just test text</a>

Getting inner text of firstATag (0th <a> tag)

val firstATagInnerText = firstATag.text

Result:

firstATagInnerText: String = this is just test text

Notice: even though your tags were not closed this parser worked fine. While your regex implementation failed this edge case.

thanks for detail, ive posted my solution to question based on your code. — blue-sky, Aug 18 '16 at 20:58

Matching hrefs within String using Scala

1 Answers1