Please, don't try and parse XML with a regex. It's brittle. XML is contextual, and regular expression aren't. So at best, it's a dirty hack, and one that may one day break without warning for the most inane reasons.
See: RegEx match open tags except XHTML self-contained tags for more.
However, XML is structured, and it's actually quite easy to work with - provided you use something well suited to the job: A parser.
I like XML::Twig
. XML::LibXML
is also excellent, but has a bit of a steeper learning curve. (You also get XPath
which is like regular expressions, but much more well suited for XML)
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
#create a list of what we want to keep. This map just turns it
#into a hash.
my %keep = map { $_ => 1 } qw ( IsTainted State Feed );
#parse the XML. If it's a file, you may want "parsefile" instead.
my $twig = XML::Twig->parse( \*DATA );
#iterate the attributes.
foreach my $att ( keys %{ $twig->root->atts } ) {
#delete the attribute unless it's in our 'keep' list.
$twig->root->del_att($att) unless $keep{$att};
}
#print it. You may find set_pretty_print useful for formatting XML.
$twig->print;
__DATA__
<Multicast ID="0/m1" Feed="EUREX-EMDI" IPPort="224.0.50.128:59098" State="CHECK" IsTainted="0" UncrossAfterGap="0" ManualUncrosses="0" AutoUncrosses="0" ExpectedSeqNo="-" />
Outputs:
<Multicast Feed="EUREX-EMDI" IsTainted="0" State="CHECK"/>
That preserves the attributes, and gives you valid XML. But if you just want the values:
foreach my $att ( qw ( Feed State IsTainted ) ) {
print $att, "=", $twig->root->att($att),"\n";
}