regex that matches for two different places on input

Question

The use case is for reformatting xml. I currently have a snippet that looks like this:

<dependencies>
    <dependency>
        <groupId>
             com.googlecode.java-diff-utils
        </groupId>
        <artifactId>
             diffutils
        </artifactId>
        <version>
             1.3.0
        </version>
    </dependency>
</dependencies>

I want it to look like this:

<dependencies>
    <dependency>
       <groupId>com.googlecode.java-diff-utils</groupId>
       <artifactId>diffutils</artifactId>
       <version>1.3.0</version>
    </dependency>
</dependencies>

So the case is that I want to match <tag></tag> pairs that do not have additional pairs within them, something like this:

output.replaceAll("<{TAG}>\\s+([^<>])\\s+</{TAG}>",  
                  "<{TAG}>($1)</{TAG}>")

where {TAG} can be matched.

You should absolutely use a XPath parser here, not regex. Search around on SO for this and you'll find what you need. — Tim Biegeleisen, Jul 20 '18 at 05:14
Have you considered... I don’t know ... xslt maybe? See also [this answer](https://stackoverflow.com/a/1732454/3690024) — AJNeufeld, Jul 20 '18 at 05:17
I think it's too overblown for this use case. I just want to match on two patterns. it's a subset of xml, without attributes. — zcaudate, Jul 20 '18 at 05:38
I would definitely use a parser if it contained attributes... however, I'm the one generating the xml and so I can have certain guarantees over what it looks like. — zcaudate, Jul 20 '18 at 05:48
Try `\\b>\\R++\\h*+((?>\\s*[^\\s<].*)+)\\s*` and replace with `>\\1`. See live demo here https://regex101.com/r/P2UXM1/1 — revo, Jul 20 '18 at 06:00

Jake Harmon · Accepted Answer · 2018-07-20T06:13:07.000

2

As others have stated, you shouldn't regex XML. It's far easier and more robust to use XML parsers.

However, since late-night regex is so fun, here's a simple one that would work here:

String output = oldStr.replaceAll("(?m)<(\\w+)>\\s+([^<>]*)$\\s+</\\1>", "<$1>$2</$1>");

Again, don't use anything like that in prod code. There are plenty of edge-cases that would break almost any regex on XML.

edited Jul 20 '18 at 06:13

answered Jul 20 '18 at 05:41

Jake Harmon

36
5

how would you know if the first `(\\w+)` is the same as the second? – zcaudate Jul 20 '18 at 05:42
Oops, fixed it. Though, since neither '<' nor '>' can come between the tags, I'm not sure it's necessary (since this means there can't be another tag between the opening and closing tag)... I'd have to think about that some more. – Jake Harmon Jul 20 '18 at 05:49
1

It's a multiline flag. It's what lets me use the end anchor (the '$') to match a newline. – Jake Harmon Jul 20 '18 at 06:08

regex that matches for two different places on input

1 Answers1