0

I am using Maven and I have a lot of dependencies I want to remove. I am trying to automate the process for future use, using Powershell and replacing the dependencies using a regex (replacing with an empty string). I have dependencies littered throughout my pom file like so:

<dependency>
    <groupId>org.springframework</groupId>
    <artifactId>spring-jdbc</artifactId>
    <version>${spring.version}</version>
</dependency>


<dependency>
    <groupId>org.springframework</groupId>
    <artifactId>spring-core</artifactId>
    <version>${spring.version}</version>
    <exclusions>
        <exclusion>
        <groupId>commons-logging</groupId>
        <artifactId>commons-logging</artifactId>
        </exclusion>
    </exclusions>
</dependency>

The current iteration of the regex is as follows:

(<dependency>)(\s*?\S*?\w*?\W*?.*?\X*?\R*?\v*?)(spring-jdbc)(\s*?\S*?\w*?\W*?.*?\X*?\R*?\v*?)(<\/dependency>)

Using the preceding regex with "spring-jdbc" I can successfully find the dependency provided it is the first one encountered. If I switch "spring-jdbc" to "spring-core", the entire the text is selected. I tried inserting negative lookaheads/behinds to try and exclude dependency tags within the pattern like so:

(<dependency>)((?!<dependency>)\s*?\S*?\w*?\W*?.*?\X*?\R*?\v*?)(spring-core)(\s*?\S*?\w*?\W*?.*?\X*?\R*?\v*?(?<!<dependency>))(<\/dependency>)

But this only stops the tags appearing immediately after the start tag and immediately before the end tag. I want the entire gap between the start dependency tag and dependency name to not include an extra start dependency tag, and the same for the gap between the dependency name and end dependency tag but this time excluding an extra end dependency tag.

A link to regex101 example.

As it stands, I am getting the impression that Powershell/regexes were not intended for this kind of task. I would probably be better off creating a Java program or something like that to read the XML but for the sake of learning Powershell, I would like to know if it's possible. There are similar examples already but few (if any) seem to have the requirement to have a known constant in the center of the regex as well as excluding words between the endpoints of the tags (most XML/HTML examples I have seen just want all the characters in the tag bodies).

Thanks for any assistance.

Cian
  • 71
  • 1
  • 2
  • 11

2 Answers2

7

Stop using RegEx to Parse XML

It's not well-suited for it generally. You can cast a string as [XML] in PowerShell and treat it like an object, including using xpath and then removing elements, then you can re-serialize it to a string.

I can't really demonstrate it without the full XML though becuase that snippet is not a valid document on its own.

briantist
  • 45,546
  • 6
  • 82
  • 127
2

I can't fault @briantist's excellent answer, but ... "use regex where it's not appropriate" is a fun challenge, so I offer:

$x=@'
<dependency>
    <groupId>org.springframework</groupId>
    <artifactId>spring-jdbc</artifactId>
    <version>${spring.version}</version>
</dependency>


<dependency>
    <groupId>org.springframework</groupId>
    <artifactId>spring-core</artifactId>
    <version>${spring.version}</version>
    <exclusions>
        <exclusion>
        <groupId>commons-logging</groupId>
        <artifactId>commons-logging</artifactId>
        </exclusion>
    </exclusions>
</dependency>
'@

Write-Host "spring-jdbc" -fore Cyan
[regex]::Matches($x, '(?m)<dependency>\r\n(^ .*\r\n)+(^ .*spring-jdbc.*\r\n)(^ .*\r\n)+</dependency>').value

Write-Host "spring-core" -fore Cyan
[regex]::Matches($x, '(?m)<dependency>\r\n(^ .*\r\n)+(^ .*spring-core.*\r\n)(^ .*\r\n)+</dependency>').value

The regex is:

  • (?m) - Multiline ($x is a single string, not an array of lines)
  • <dependency>\r\n - opening tag
  • (^ .*\r\n)+ - one or more lines beginning with a space
  • (^ .*spring-core.*\r\n) - a line, beginning with a space, that includes the search text
  • (^ .*\r\n)+ - one or more lines beginning with a space
  • </dependency> - closing tag

So this will only work if your indentation is as given in your snippet. This is terrible, XML is structured, and parsing it should not depend on its presentation.

Instead, you should process it as XML document, e.g. by adding a fake root node to your snippet, I can do this:

$y = [xml]"<root>$x</root>"
$badDep = $y.root.dependency |Where artifactId -eq 'spring-jdbc'
$y.root.RemoveChild($badDep)
$y.InnerXml

Presumably your entire document is valid XML, so you wouldn't need to do that. I'm not sure about good XML processing and serializing out to text.

TessellatingHeckler
  • 27,511
  • 4
  • 48
  • 87
  • For extra fun, why not use [balancing regex groups](http://www.regular-expressions.info/balancing.html) ;) The .Net RegEx engine is one of the few that support them. Excellent answer! – briantist Jul 01 '16 at 18:06
  • Oh wow, that's cool. Thanks! Yeah, the document doesn't necessarily have that indentation, I will switch to processing it as XML. Thanks so much. – Cian Jul 04 '16 at 09:38