0

I am trying to find all occurrences (there could be zero or more) of anchor(<a>) HTML tags with specific attributes/text (to be captured as groups).

Regex: <a\s+.*attr1="myattr".*attr2="(.+)".*attr3="(.+)".*>(.+)</a>

Input String:

First <a attr1="myattr" attr2="value12" attr3="value13">text1</a> Second <a attr1="myattr" attr2="value12" attr3="value13">text1</
a> Third <a attr1="myattr" attr2="value12" attr3="value13">text1</a>

Outcome: It says only one occurrence found. It is returning the first occurrence of "<a" as starting index (of match) which is correct, but it says the last occurrence of "</a>" as the end index. Expected outcome: 3 matches and in each match 3 groups (specified by parentheses in the regex).

Pshemo
  • 122,468
  • 25
  • 185
  • 269
  • Are there any special reasons why you want to use regex instead of proper HTML parser like jsoup? – Pshemo Mar 28 '15 at 11:04

3 Answers3

3

Regex is by default greedy, so .* will match as many characters as possible. You should instead use .*? for non-greedy mode.

user4098326
  • 1,712
  • 4
  • 16
  • 20
2

As stated by user4098326 the problem is the greedyness. Since you used plenty of .+ groups, they will eat up as much characters as possible until the end of the string.

The <a\s+.* (more specific the .*) eats up all characters until the last appearance of attr1="myattr". The remainder of the string then fulfills the remainder of the expression.

thst
  • 4,592
  • 1
  • 26
  • 40
1

Regex is not best tool for parsing HTML. Things like easily made mistakes (like in case of using greedy .* instead of reluctant .*?) are only small part of reason why we should avoid mixing regex with XML/HTML.

  • what if attributes are surrounded with ' instead of "?
  • What if your HTML contains JavaScript code which has document.write("<a attr1='foo' attr2='bar'>text</a><br/>")
  • what if element is inside comment?

There are many more and better reasons to avoid regex and HTML. So instead of regex you should consider using proper parser like jsoup.

This way your code selecting all a tags with attributes attr1 attr2 attr3 can look like

Elements elements = doc.select("a[attr1][attr2][attr3]");

Demo:

String html = "First <a attr1=\"myattr\" attr2=\"value12\" attr3=\"value13\">text1</a>"
        + " Second <a attr1=\"myattr\" attr2=\"value12\" attr3=\"value13\">text1</a>"
        + " Third <a attr1=\"myattr\" attr2=\"value12\" attr3=\"value13\">text1</a>";

Document doc = Jsoup.parse(html);
Elements elements = doc.select("a[attr1][attr2][attr3]");

for (Element el: elements){
    System.out.println(el);
    System.out.println(el.attr("attr1"));
    System.out.println(el.attr("attr2"));
    System.out.println(el.attr("attr3"));
    System.out.println(el.text());
    System.out.println("--------------");
}

Output:

<a attr1="myattr" attr2="value12" attr3="value13">text1</a>
myattr
value12
value13
text1
--------------
<a attr1="myattr" attr2="value12" attr3="value13">text1</a>
myattr
value12
value13
text1
--------------
<a attr1="myattr" attr2="value12" attr3="value13">text1</a>
myattr
value12
value13
text1
--------------
Community
  • 1
  • 1
Pshemo
  • 122,468
  • 25
  • 185
  • 269