1

In essence I am trying to modify every token on a line that matches a criteria. I have a file with many lines and the the line can have many instances. Each line may or may not match. What I want to replace ar XML values, for example

<ns0:house>indifferent token</ns0:house> --> <ns0:house>xxx</ns0:house>
the token indifferent token will be replaced with xxx

It is not guaranteed that the XML be completed (it could be snippet) ...

Here is what I have

 $output =~ s/(<.+house>)(.*)(\/.+house>)/$1xxx$3/g

I would read this as substitute, globally, all characters between and (I simplified the XMl element (but the .+ should account for any arbitrary namespace).

The resulting string has only some occurrence replaced. Logically in my head I know it has to do with the REGEX being greedy, but I cannot figure out how to fix it. I have pulled all my hair out trying to work around this.

I believe I have an alternative (more code) using split, but that is ugly.

Thought or suggestions welcomed.

dwfa
  • 21
  • 2
  • Can you give specific examples of input strings that worked as expected, vs ones that should have matched but didn't? If you're just wondering about how to make your regex quantifiers lazy instead of greedy, that's done by adding a `?`, i.e. `<.+?house>`. – CAustin Nov 27 '19 at 20:45
  • Try [this](https://regex101.com/r/dqsKoB/1): `s/(<([^>]+)house>).*?(<\/\2house>)/$1xxx$3/g` if you must use regex; preferably, use a parser. – ctwheels Nov 27 '19 at 20:45
  • The actual sample is too large to display here (and of course I had to change the name of the tags to protect the innocent). I do know that this (unable to provide sample) is a limitation I need to deal with; but thx for your support. – dwfa Nov 29 '19 at 15:24
  • I should have added that the XML is not always a complete XML document, it could be a snippet and is not guaranteed to be well-formatted (i.e. have all tags balanced). – dwfa Nov 29 '19 at 15:44

3 Answers3

3

Don't use regular expressions, all your problems are because of this approach. There are several options for XML parsers, here's how you could do it with Mojo::DOM:

use strict;
use warnings;
use Mojo::DOM;

my $data = q{<ns0:house>indifferent token</ns0:house>};
my $dom = Mojo::DOM->new->xml(1)->parse($data);
foreach my $tag ($dom->find('house')->each) {
  $tag->content('xxx'); # this should already be XML-escaped if needed
}
print $dom;
Grinnz
  • 9,093
  • 11
  • 18
1

A good regex to find house open/close tag and replace it's content
with xxx :

$data =~ s/(?s)<((?>[\w:]+:)?house)(?>\s+(?:".*?"|'.*?'|[^>]*?)+)?\s*>(?<!\/>)\K(?:(?!<\/\1\s*>).)*?(?=<\/\1\s*>)/xxx/g;

or something quicker

$data =~ s/(?s)<((?>[\w:]+:)?house)(?>\s+(?:".*?"|'.*?'|[^>]*?)+)?\s*>(?<!\/>)\K.*?(?=<\/\1\s*>)/xxx/g;

0

Thanks to all who provided suggestions and support. I found a solution that seems to be working and here is the RegEx

s/(<([^>]+)(house)>)(.*?)(<\/\2\3>)/$1xxx$5/g

yeah !!! - thx again

To make it even more extensible, in my mind, I used perl variable substitution capability

s/(<([^>]+)(\Q$token\E)>)(.*?)(<\/\2\3>)/$1xxx$5/g
dwfa
  • 21
  • 2