Read this SO Answer first please.
Now coming back.
You need to use a reliable html parser lib to parse html strings, regex won't be enough in most non-trivial cases.
Regex won't get the job done because
- It is error prone, we make mistakes writing regex all the time, plus you are tge only verifier and maintainer (maintenance nightmare)
- It is hard to maintain and document
- It is hard to test, you will have to think of all possible test case strings for your regex and then write test cases for it.
Why an Html parser is better
Not error prone, has been verified by multiple contributors and users, unlike your regex which only you use and verify
Documented in its own site and javadoc
Html Parsing already tested in the library itself, you can focus on testing your app functionality or business use case.
CSS selectors and DOM structure to select and manipulate the Html. (This is the biggest benefit, you will need css selectors support for any serious html work.)
As a result of this, I would suggest you to use Jsoup html parser. Below I describe usage for your case.
First get the dependency or just download the jar. Maven dependency as below:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.9.2</version>
</dependency>
Next the imports
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
Now parsing your html string
val str = "tester<a href=\"t1\">this is just test text<a href=\"t2\">\\r\\t\\s<a href=\"t3\">"
val doc = Jsoup.parse(str)
What this gives:
doc: org.jsoup.nodes.Document =
<html>
<head></head>
<body>
tester
<a href="t1">this is just test text</a>
<a href="t2">\r\t\s</a>
<a href="t3"></a>
</body>
</html>
Notice the full structure generated with cleaned tags from your string.
Getting all <a>
tags
val aTags = doc.select("a")
Result:
aTags: org.jsoup.select.Elements =
<a href="t1">this is just test text</a>
<a href="t2">\r\t\s</a>
<a href="t3"></a>
Getting all <a>
tag string representation
val aTagsString = aTags.toString
Result:
aTagsString: String =
<a href="t1">this is just test text</a>
<a href="t2">\r\t\s</a>
<a href="t3"></a>
Getting first or 0th <a>
tag
val firstATag = doc.select("a").get(0)
Result:
firstATag: org.jsoup.nodes.Element = <a href="t1">this is just test text</a>
Getting string representation of first <a>
tag
val firstATagString = firstATag.toString
Result:
firstATagString: String = <a href="t1">this is just test text</a>
Getting inner text of firstATag (0th <a>
tag)
val firstATagInnerText = firstATag.text
Result:
firstATagInnerText: String = this is just test text
Notice: even though your tags were not closed this parser worked fine. While your regex implementation failed this edge case.