-1

I am experimenting with regular expressions in Java, in particular with groups. I am trying to strip empty tags from a string with xml. Without using groups, everything seems to be fine, but if I try to define a regex using groups, magic begins that I don't understand. I expect behavior like last assertion in code below:

    @Test
    public void testRegexpGroups() {
        String xml =
            "<root>\n" +
                "    <yyy></yyy>\n" +
                "    <yyy>456</yyy>\n" +
                "    <aaa>  \n\n" +
                "    </aaa>\n" +
                "</root>";
        Pattern patternA = Pattern.compile("(\\s*)<(\\s*\\w+\\s*)>(\\1)</(\\2)>");
        Pattern patternB = Pattern.compile("(\\s*)<(\\s*\\w+\\s*)>\\s*</(\\2)>");
        Pattern patternC = Pattern.compile("\\s*<\\s*\\w+\\s*>\\s*</\\s*\\w+\\s*>");


        assertEquals(
            "<root>\n" +
            "    \n" +
            "    <yyy>456</yyy>\n" +
            "    <aaa>  \n" +
            "\n" +
            "    </aaa>\n" +
            "</root>",
            patternA.matcher(xml).replaceAll("")
        );

        assertEquals(
            "<root>\n" +
                "    <yyy>456</yyy>\n" +
                "</root>",
            patternB.matcher(xml).replaceAll("")
        );

        assertEquals(
            "<root>\n" +
                "    <yyy>456</yyy>\n" +
                "</root>",
            patternC.matcher(xml).replaceAll("")
        );
    }

I can get it if I use this regex: "\\s*<\\s*\\w+\\s*>\\s*</\\s*\\w+\\s*>", but I don't understand why I can't do the same with "(\\s*)<(\\s*\\w+\\s*)>(\\1)</(\\2)>" Please explain to me the difference in the behavior of the regular expressions specified here.

  • 1
    I think in general it is not a good idea to parse xml or html with regex. Maybe that is also the source of your problem? See: https://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – team17 Aug 19 '20 at 10:54
  • 1
    Can you explain what output you see and what you find confusing about it? – Joni Aug 19 '20 at 10:57
  • It is good that you included something we can run ourselves, but still: dont expect people to download and run your code. Thus: tell us what you expect to happen, and what happens. Please understand: just running your code, and trying to figure what you are asking A) costs plenty of time and B) makes room for misunderstanding. We really dont know what exactly you want to do, and why, and so on. So be precise about that, instead of going "please explain differendes". – GhostCat Aug 19 '20 at 11:07
  • @Joni, I expect behavior like last assertion. I can get it if I use this regex: ``"\\s*<\\s*\\w+\\s*>\\s*(\\s*\\w+\\s*)>"``, but I don't understand why I can't do the same with ``"(\\s*)<(\\s*\\w+\\s*)>(\\1)(\\2)>"`` – Beardless Monk Aug 19 '20 at 11:08
  • First occurrence: `\\1` will be the matching of the first `(\\s*)` but at the same time that can only be when `\\s*` matches the empty string (`\1` is empty). The second occurrence like wise. Use only `<\\s*(\\w+)[^>]*>...\\s*\\1\\s*>`. – Joop Eggen Aug 19 '20 at 11:11
  • Please do NEVER put such information into comments. Always update your question instead. – GhostCat Aug 19 '20 at 11:25
  • @JoopEggen thanks a lot. Another one question, why this ``[^>]*`` should be before first ``>``? – Beardless Monk Aug 19 '20 at 11:26
  • You just did not consider attributes like in ``. Was automatically written. – Joop Eggen Aug 19 '20 at 11:30

1 Answers1

0

In regular expressions, \1 and \2 are called back references. They look for the same text that was matched previously by a capturing group. They enable you to write regular expressions that for example detect duplicated letters and words.

For example (\w+)\1 matches strings "words" that are the same text repeated twice.

"banana".matches("(\\w+)\\1") // ==> false

"banabana".matches("(\\w+)\\1") // ==> true: bana is repeated

In your regexp "(\\s*)<(\\s*\\w+\\s*)>(\\1)</(\\2)>" you require that the white space within the tag matches the white space before the tag.

Joni
  • 108,737
  • 14
  • 143
  • 193