1

The use case is for reformatting xml. I currently have a snippet that looks like this:

<dependencies>
    <dependency>
        <groupId>
             com.googlecode.java-diff-utils
        </groupId>
        <artifactId>
             diffutils
        </artifactId>
        <version>
             1.3.0
        </version>
    </dependency>
</dependencies>

I want it to look like this:

<dependencies>
    <dependency>
       <groupId>com.googlecode.java-diff-utils</groupId>
       <artifactId>diffutils</artifactId>
       <version>1.3.0</version>
    </dependency>
</dependencies>

So the case is that I want to match <tag></tag> pairs that do not have additional pairs within them, something like this:

output.replaceAll("<{TAG}>\\s+([^<>])\\s+</{TAG}>",  
                  "<{TAG}>($1)</{TAG}>")

where {TAG} can be matched.

zcaudate
  • 13,998
  • 7
  • 64
  • 124
  • 2
    You should absolutely use a XPath parser here, not regex. Search around on SO for this and you'll find what you need. – Tim Biegeleisen Jul 20 '18 at 05:14
  • Have you considered... I don’t know ... xslt maybe? See also [this answer](https://stackoverflow.com/a/1732454/3690024) – AJNeufeld Jul 20 '18 at 05:17
  • I think it's too overblown for this use case. I just want to match on two patterns. it's a subset of xml, without attributes. – zcaudate Jul 20 '18 at 05:38
  • 1
    I would definitely use a parser if it contained attributes... however, I'm the one generating the xml and so I can have certain guarantees over what it looks like. – zcaudate Jul 20 '18 at 05:48
  • 1
    Try `\\b>\\R++\\h*+((?>\\s*[^\\s<].*)+)\\s*` and replace with `>\\1`. See live demo here https://regex101.com/r/P2UXM1/1 – revo Jul 20 '18 at 06:00
  • @revo, does it work with java? – zcaudate Jul 20 '18 at 06:12
  • 1
    Yes, just give the same strings to `replaceAll` method. – revo Jul 20 '18 at 06:13

1 Answers1

2

As others have stated, you shouldn't regex XML. It's far easier and more robust to use XML parsers.

However, since late-night regex is so fun, here's a simple one that would work here:

String output = oldStr.replaceAll("(?m)<(\\w+)>\\s+([^<>]*)$\\s+</\\1>", "<$1>$2</$1>");

Again, don't use anything like that in prod code. There are plenty of edge-cases that would break almost any regex on XML.

  • how would you know if the first `(\\w+)` is the same as the second? – zcaudate Jul 20 '18 at 05:42
  • Oops, fixed it. Though, since neither '<' nor '>' can come between the tags, I'm not sure it's necessary (since this means there can't be another tag between the opening and closing tag)... I'd have to think about that some more. – Jake Harmon Jul 20 '18 at 05:49
  • 1
    It's a multiline flag. It's what lets me use the end anchor (the '$') to match a newline. – Jake Harmon Jul 20 '18 at 06:08