Regex is not best tool for parsing HTML. Things like easily made mistakes (like in case of using greedy .*
instead of reluctant .*?
) are only small part of reason why we should avoid mixing regex with XML/HTML.
- what if attributes are surrounded with
'
instead of "
?
- What if your HTML contains JavaScript code which has
document.write("<a attr1='foo' attr2='bar'>text</a><br/>")
- what if element is inside comment?
There are many more and better reasons to avoid regex and HTML. So instead of regex you should consider using proper parser like jsoup.
This way your code selecting all a
tags with attributes attr1
attr2
attr3
can look like
Elements elements = doc.select("a[attr1][attr2][attr3]");
Demo:
String html = "First <a attr1=\"myattr\" attr2=\"value12\" attr3=\"value13\">text1</a>"
+ " Second <a attr1=\"myattr\" attr2=\"value12\" attr3=\"value13\">text1</a>"
+ " Third <a attr1=\"myattr\" attr2=\"value12\" attr3=\"value13\">text1</a>";
Document doc = Jsoup.parse(html);
Elements elements = doc.select("a[attr1][attr2][attr3]");
for (Element el: elements){
System.out.println(el);
System.out.println(el.attr("attr1"));
System.out.println(el.attr("attr2"));
System.out.println(el.attr("attr3"));
System.out.println(el.text());
System.out.println("--------------");
}
Output:
<a attr1="myattr" attr2="value12" attr3="value13">text1</a>
myattr
value12
value13
text1
--------------
<a attr1="myattr" attr2="value12" attr3="value13">text1</a>
myattr
value12
value13
text1
--------------
<a attr1="myattr" attr2="value12" attr3="value13">text1</a>
myattr
value12
value13
text1
--------------