I am experimenting with regular expressions in Java, in particular with groups. I am trying to strip empty tags from a string with xml. Without using groups, everything seems to be fine, but if I try to define a regex using groups, magic begins that I don't understand. I expect behavior like last assertion in code below:
@Test
public void testRegexpGroups() {
String xml =
"<root>\n" +
" <yyy></yyy>\n" +
" <yyy>456</yyy>\n" +
" <aaa> \n\n" +
" </aaa>\n" +
"</root>";
Pattern patternA = Pattern.compile("(\\s*)<(\\s*\\w+\\s*)>(\\1)</(\\2)>");
Pattern patternB = Pattern.compile("(\\s*)<(\\s*\\w+\\s*)>\\s*</(\\2)>");
Pattern patternC = Pattern.compile("\\s*<\\s*\\w+\\s*>\\s*</\\s*\\w+\\s*>");
assertEquals(
"<root>\n" +
" \n" +
" <yyy>456</yyy>\n" +
" <aaa> \n" +
"\n" +
" </aaa>\n" +
"</root>",
patternA.matcher(xml).replaceAll("")
);
assertEquals(
"<root>\n" +
" <yyy>456</yyy>\n" +
"</root>",
patternB.matcher(xml).replaceAll("")
);
assertEquals(
"<root>\n" +
" <yyy>456</yyy>\n" +
"</root>",
patternC.matcher(xml).replaceAll("")
);
}
I can get it if I use this regex: "\\s*<\\s*\\w+\\s*>\\s*</\\s*\\w+\\s*>"
, but I don't understand why I can't do the same with "(\\s*)<(\\s*\\w+\\s*)>(\\1)</(\\2)>"
Please explain to me the difference in the behavior of the regular expressions specified here.