Perl / Regex String Manipulation for multiple matches

Question

I have the following string:

<Multicast ID="0/m1" Feed="EUREX-EMDI" IPPort="224.0.50.128:59098" State="CHECK" IsTainted="0" UncrossAfterGap="0" ManualUncrosses="0" AutoUncrosses="0" ExpectedSeqNo="-" />

I need to strip everything in this string apart from:

Feed="EUREX-EMDI"
State="CLOSED"
IsTainted="0"

I have managed to get "Feed="EUREX-EMDI"" with the following code:

s/^[^Feed]*(?=Feed)//;

So it now looks like:

Feed="EUREX-EMDI" IPPort="224.0.50.0:59098" State="CLOSED" IsTainted="0" UncrossAfterGap="0" ManualUncrosses="0" AutoUncrosses="0" ExpectedSeqNo="2191840" />

However I now don't know how to look for the next part "State="CLOSED"" in the string whilst ignoring my already found "Feed="EUREX-EMDI"" match

You need an [XML parser](http://stackoverflow.com/search?q=%5Bperl%5D+parse+xml), not a regular expression. — Matt Jacob, Mar 31 '16 at 16:55
However, in this case, if your data really is as simple as you suggest, you might be able to get away with `%hash = $str =~ /(\w+)="([^"]+)"/g;` — Matt Jacob, Mar 31 '16 at 17:08

score 1 · Accepted Answer · answered Mar 31 '16 at 19:04

The perl idiom for this type of thing is a multiple assignment from regex capture groups. Assuming you can always count on the items of interest being in the same order and format (quoting):

($feed, $state, $istainted) = /.*(Feed="[^"]*").*(State="[^"]*").*(IsTainted="[^"]*")/;

Or if you only want to capture the (unquoted) values themselves, change the parentheses (capture groups):

($feed, $state, $istainted) = /.*Feed="([^"]*)".*State="([^"]*)".*(IsTainted="([^"]*)"/;

score 1 · Answer 2 · edited May 23 '17 at 12:23

Please, don't try and parse XML with a regex. It's brittle. XML is contextual, and regular expression aren't. So at best, it's a dirty hack, and one that may one day break without warning for the most inane reasons.

See: RegEx match open tags except XHTML self-contained tags for more.

However, XML is structured, and it's actually quite easy to work with - provided you use something well suited to the job: A parser.

I like XML::Twig. XML::LibXML is also excellent, but has a bit of a steeper learning curve. (You also get XPath which is like regular expressions, but much more well suited for XML)

#!/usr/bin/env perl
use strict;
use warnings;

use XML::Twig;
#create a list of what we want to keep. This map just turns it
#into a hash. 
my %keep = map { $_ => 1 } qw ( IsTainted State Feed );

#parse the XML. If it's a file, you may want "parsefile" instead. 
my $twig = XML::Twig->parse( \*DATA );

#iterate the attributes. 
foreach my $att ( keys %{ $twig->root->atts } ) {
   #delete the attribute unless it's in our 'keep' list. 
   $twig->root->del_att($att) unless $keep{$att};
}
#print it. You may find set_pretty_print useful for formatting XML. 
$twig->print;

__DATA__
<Multicast ID="0/m1" Feed="EUREX-EMDI" IPPort="224.0.50.128:59098" State="CHECK" IsTainted="0" UncrossAfterGap="0" ManualUncrosses="0" AutoUncrosses="0" ExpectedSeqNo="-" />

Outputs:

<Multicast Feed="EUREX-EMDI" IsTainted="0" State="CHECK"/>

That preserves the attributes, and gives you valid XML. But if you just want the values:

foreach my $att ( qw ( Feed State IsTainted ) ) {
   print $att, "=", $twig->root->att($att),"\n";
}

score 0 · Answer 3 · 2016-03-31T17:43:41.737

This will strip all but those strings.

If you want to include a space separator, make the replacement ' $1'.
Explained

 (?s)                          # Dot - all
 (?:                           # To be removed
      (?!
           (?: Feed | State | IsTainted )
           \s* = \s* " .*? "
      )
      . 
 )*
 (?:                           # To be saved
      (                             # (1 start)
           (?: Feed | State | IsTainted )
           \s* = \s* " .*? "
      )                             # (1 end)
   |  $ 
 )

Perl / Regex String Manipulation for multiple matches

3 Answers3