I think the answer above is great if all your input XML is consistent with your example, i.e. very simple containing only elements, or you only have a handful of files to validate afterwards. In general, processing XML as text is a bad thing. By it's nature, it isn't text; it's highly structured. For instance, if encoding matters, varies, say, you'll definitely want to parse it as XML.
I've become partial to XML::Twig, because of the option to stream (one can also build an XML Tree), which is a parse style much closer to the command-line edit you already seen here. I deal with a great deal of data. XML::Twig is actually very easy to use, but the initial learning curve on implementation/config may take a bit of research effort.
Some people prefer XML::Lib (a little simpler to setup), which offers a more DOM-style flavor, but is more expensive applied to large data sets, and a bit more unwieldy with very large files. From there, various modules get a little less complex, XML::Simple.
Again, this greatly depends on your requirements, data size, validation standards etc. The one-liner is quick, but not quite best practice for handling XML.
Possible Solution
Assumptions -
- Your XML is well-formed; that is, it has a root element.
- Your chapters could extend to some number greater than one to which you're
willing to type.
- You won't have chapter values with some form of decimal/fraction (One.One,
or One and a Half etc.)
You could use XML::Twig and Lingua::EN::Words2Nums
So, given input:
<root>
<h2>Chapter One</h2>
<h2>Chapter Two</h2>
<h2>Chapter Three</h2>
<h2>Chapter Four</h2>
</root>
This code:
use XML::Twig;
use Lingua::EN::Words2Nums;
my $twig = new XML::Twig(
twig_roots => { 'h2' => \&h2_handler },
twig_print_outside_roots => 1);
sub h2_handler {
my ($twig,$elt) = @_;
my $engNum = $elt->trimmed_text;
$engNum =~ s/^chapter\s([a-z]+)$/$1/i;
my $num = words2nums("$engNum");
if (defined($num) and $num=~/\d+/){
$elt->set_att( id => $num);
}else{
# Whatever you do if some chapter number is not what's expected
}
$elt->flush;
}
$twig->parsefile(pathToYourFile);
Will output:
<root>
<h2 id="1">Chapter One</h2>
<h2 id="2">Chapter Two</h2>
<h2 id="3">Chapter Three</h2>
<h2 id="4">Chapter Four</h2>
</root>