Find multiple matches using Regex

Question

I am trying to find all occurrences (there could be zero or more) of anchor(<a>) HTML tags with specific attributes/text (to be captured as groups).

Regex: <a\s+.*attr1="myattr".*attr2="(.+)".*attr3="(.+)".*>(.+)</a>

Input String:

First <a attr1="myattr" attr2="value12" attr3="value13">text1</a> Second <a attr1="myattr" attr2="value12" attr3="value13">text1</
a> Third <a attr1="myattr" attr2="value12" attr3="value13">text1</a>

Outcome: It says only one occurrence found. It is returning the first occurrence of "<a" as starting index (of match) which is correct, but it says the last occurrence of "</a>" as the end index. Expected outcome: 3 matches and in each match 3 groups (specified by parentheses in the regex).

Are there any special reasons why you want to use regex instead of proper HTML parser like jsoup? — Pshemo, Mar 28 '15 at 11:04

score 3 · Accepted Answer · answered Mar 28 '15 at 10:59

3

Regex is by default greedy, so .* will match as many characters as possible. You should instead use .*? for non-greedy mode.

answered Mar 28 '15 at 10:59

user4098326

1,712
4
16
20

Thanks for your help. Non-greedy mode helped fix the issue. – user2568887 Mar 28 '15 at 14:06

score 2 · Answer 2 · answered Mar 28 '15 at 11:04

As stated by user4098326 the problem is the greedyness. Since you used plenty of .+ groups, they will eat up as much characters as possible until the end of the string.

The <a\s+.* (more specific the .*) eats up all characters until the last appearance of attr1="myattr". The remainder of the string then fulfills the remainder of the expression.

score 1 · Answer 3 · edited May 23 '17 at 11:56

Regex is not best tool for parsing HTML. Things like easily made mistakes (like in case of using greedy .* instead of reluctant .*?) are only small part of reason why we should avoid mixing regex with XML/HTML.

what if attributes are surrounded with ' instead of "?
What if your HTML contains JavaScript code which has document.write("<a attr1='foo' attr2='bar'>text</a><br/>")
what if element is inside comment?

There are many more and better reasons to avoid regex and HTML. So instead of regex you should consider using proper parser like jsoup.

This way your code selecting all a tags with attributes attr1 attr2 attr3 can look like

Elements elements = doc.select("a[attr1][attr2][attr3]");

Demo:

String html = "First <a attr1=\"myattr\" attr2=\"value12\" attr3=\"value13\">text1</a>"
        + " Second <a attr1=\"myattr\" attr2=\"value12\" attr3=\"value13\">text1</a>"
        + " Third <a attr1=\"myattr\" attr2=\"value12\" attr3=\"value13\">text1</a>";

Document doc = Jsoup.parse(html);
Elements elements = doc.select("a[attr1][attr2][attr3]");

for (Element el: elements){
    System.out.println(el);
    System.out.println(el.attr("attr1"));
    System.out.println(el.attr("attr2"));
    System.out.println(el.attr("attr3"));
    System.out.println(el.text());
    System.out.println("--------------");
}

Output:

<a attr1="myattr" attr2="value12" attr3="value13">text1</a>
myattr
value12
value13
text1
--------------
<a attr1="myattr" attr2="value12" attr3="value13">text1</a>
myattr
value12
value13
text1
--------------
<a attr1="myattr" attr2="value12" attr3="value13">text1</a>
myattr
value12
value13
text1
--------------

Find multiple matches using Regex

3 Answers3