0

Using this regex (<a[^>]+>.+?<\\/a>) I'm attempting to print the matching links.

So t1,t2,t3 should be printed but nothing is printed :

val str = "tester<a href=\"t1\">this is just test text<a href=\"t2\">\\r\\t\\s<a href=\"t3\">"
val re = "(<a[^>]+>.+?<\\/a>)".r
for (p <- re findAllIn str) p match {
  case re(b) => print(b)
}

Is the regex or how the regex is implemented incorrect ?

Update :

Using accepted answer this will download all valid hrefs (begin with 'http') from a url, in this case https://news.ycombinator.com/:

import scala.io.Source
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import scala.collection.JavaConversions._

object Main extends App {

  val hrefs = getHrefsFromPage("https://news.ycombinator.com/");

  hrefs.foreach(e => println(e))

  def getHrefsFromPage(url: String): List[(String, String)] = {

    val doc = Jsoup.parse(Source.fromURL(url).mkString)
    val aTags = doc.select("a").iterator.toList
    val ts = (for (t <- aTags) yield (t.attr("href"), t.text))
    val fts = ts.filter(f => f._1.trim.startsWith("http"))

    return fts;
  }

}
blue-sky
  • 51,962
  • 152
  • 427
  • 752
  • Mandatory SO Link: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 . Read my answer to see an alternate way of parsing html. And Please don't parse html with regex. Html is not a regular language so regex parsing is not reliable (english language as used is finite, hence regular, so we use regex there). – faizan Aug 18 '16 at 00:54

1 Answers1

2

Read this SO Answer first please.

Now coming back.

You need to use a reliable html parser lib to parse html strings, regex won't be enough in most non-trivial cases.

Regex won't get the job done because

  • It is error prone, we make mistakes writing regex all the time, plus you are tge only verifier and maintainer (maintenance nightmare)
  • It is hard to maintain and document
  • It is hard to test, you will have to think of all possible test case strings for your regex and then write test cases for it.

Why an Html parser is better

  • Not error prone, has been verified by multiple contributors and users, unlike your regex which only you use and verify

  • Documented in its own site and javadoc

  • Html Parsing already tested in the library itself, you can focus on testing your app functionality or business use case.

  • CSS selectors and DOM structure to select and manipulate the Html. (This is the biggest benefit, you will need css selectors support for any serious html work.)

As a result of this, I would suggest you to use Jsoup html parser. Below I describe usage for your case.

First get the dependency or just download the jar. Maven dependency as below:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.9.2</version>
</dependency>

Next the imports

import org.jsoup.Jsoup
import org.jsoup.nodes.Document

Now parsing your html string

val str = "tester<a href=\"t1\">this is just test text<a href=\"t2\">\\r\\t\\s<a href=\"t3\">"
val doc = Jsoup.parse(str)

What this gives:

doc: org.jsoup.nodes.Document =
<html>
 <head></head>
 <body>
  tester
  <a href="t1">this is just test text</a>
  <a href="t2">\r\t\s</a>
  <a href="t3"></a>
 </body>
</html>

Notice the full structure generated with cleaned tags from your string.

Getting all <a> tags

val aTags = doc.select("a")

Result:

aTags: org.jsoup.select.Elements =
<a href="t1">this is just test text</a>
<a href="t2">\r\t\s</a>
<a href="t3"></a>

Getting all <a> tag string representation

val aTagsString = aTags.toString

Result:

aTagsString: String =
<a href="t1">this is just test text</a>
<a href="t2">\r\t\s</a>
<a href="t3"></a>

Getting first or 0th <a> tag

val firstATag = doc.select("a").get(0)

Result:

firstATag: org.jsoup.nodes.Element = <a href="t1">this is just test text</a>

Getting string representation of first <a> tag

val firstATagString = firstATag.toString

Result:

firstATagString: String = <a href="t1">this is just test text</a>

Getting inner text of firstATag (0th <a> tag)

val firstATagInnerText = firstATag.text

Result:

firstATagInnerText: String = this is just test text

Notice: even though your tags were not closed this parser worked fine. While your regex implementation failed this edge case.

Community
  • 1
  • 1
faizan
  • 578
  • 5
  • 14