Please, please, please. Don't use regular expressions to parse XML.
It's bad news. It's brittle and hacky, and most importantly of all - completely unnecessary.
Regular expressions do not handle context. And XML is all about context.
XML
already has a query language called xpath
which is far better suited.
Here's an example of finding a node using xpath
.
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig -> new -> parsefile ('yourfile.xml');
print $twig -> get_xpath('//consumer', 0) -> att('attribute'),"\n";
But if you want to transform it and delete attribute
:
$_ -> del_att('attribute') for $twig -> get_xpath('//consumer[@attribute]');
$twig -> set_pretty_print('indented_a');
$twig -> print;
I would ask though - why are you trying to do that? It sounds rather like another broken process somewhere - maybe another script trying to regex
the XML?
But the other thing that XML::Twig
does really well is it has twig_handlers
that let you handle XML streams more neatly (e.g. without needing to parse it all into memory.
That goes a bit like this:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
sub delete_unwanted {
my ( $twig, $element ) = @_;
$element -> del_att('attribute');
#dump progress so far 'out'.
$twig -> flush;
#free memory already processed.
$twig -> purge;
}
my $twig = XML::Twig -> new ( twig_handlers => { '//consumer[@attribute]' => \&delete_unwanted } );
$twig -> parsefile ( 'your_xml.xml');
We set a handler, so that each time the parser encounters a consumer
with an attribute
attribute. (Bad name that) it deletes it, flush
es (prints) the parsed XML, and purges it from memory. This makes it very memory efficient, as you're not reading the whole thing into memory, and can do pretty much inline regexing type operations.