I have to extract some information from an XML file according to a pattern. I did complete a working script, but I'm pretty sure that it could be a lot simpler and/or cleaner.
Could you tell me what could be better and why?
What my input looks like:
<modifs>
<modif id="14661"><code c="1" /><extra id="109816" /><avant num_words="1">démissionné</avant><apres num_words="1">démissionner</apres></modif>
<modif id="125247"><code c="1" /><avant num_words="1">demis-tons</avant><apres num_words="1">demi-tons</apres></modif>
<modif id="90891"><code c="1" /><avant num_words="1">démit</avant><apres num_words="1">démis</apres></modif>
<modif id="198379"><code c="1" /><avant num_words="1">demi-terain</avant><apres num_words="1">demi-terrain</apres></modif>
<modif id="172795"><code c="1" /><avant num_words="1">demi-ton</avant><apres num_words="1">demi-tons</apres></modif>
</modifs>
What I want :+
Display, when the content of avant
and apres
tags ends with -é
or -er
, each id
and extra id
, followed by the content of avant
and apres
.
So it looks like this:
id="14661"
extra id="109816"
démissionné |||| démissionner
What my script looks like:
use strict;
use warnings;
my $fichier = 'path';
my $fichiersortie = "path";
my @lignes ;
my @tableau_avant ;
my @tableau_apres ;
my @ids ;
my @extraids ;
my @radical_avant ;
my @radical_apres ;
open (OUTPUT, ">$fichiersortie");
binmode(OUTPUT, ":utf8");
open(my $fh, '<:encoding(UTF-8)', $fichier)
or die "Can't open file";
while (my $row = <$fh>) {
chomp $row;
@radical_avant = $row =~ /<avant.+?>(.+?)(?:er|é)<\/avant>/;
@radical_apres = $row =~ /<apres.+?>(.+?)(?:er|é)<\/apres>/ ;
@tableau_avant = $row =~ /<avant.+?>(.+?(?:er|é))<\/avant>/;
@tableau_apres = $row =~ /<apres.+?>(.+?(?:er|é))<\/apres>/ ;
@ids = $row =~ /<modif (id="\d+")>/ ;
@extraids = $row =~ /<(extra id="\d+")\s\/>/g ;
foreach my $id (@ids) {
foreach my $match_avant (@tableau_avant) {
foreach my $match_apres (@tableau_apres) {
foreach my $radical_avant (@radical_avant){
foreach my $radical_apres (@radical_apres){
if ($radical_avant eq $radical_apres) {
print OUTPUT "$id\n";
foreach my $extraid (@extraids) {
print OUTPUT "$extraid\n";}
print OUTPUT "$match_avant" . " |||| " . "$match_apres\n\n" ;}
}
}
}
}
}
}
close (OUTPUT);
Tidied up, the Perl code looks like this
use strict;
use warnings;
my $fichier = 'path';
my $fichiersortie = "path";
my @lignes;
my @tableau_avant;
my @tableau_apres;
my @ids;
my @extraids;
my @radical_avant;
my @radical_apres;
open( OUTPUT, ">$fichiersortie" );
binmode( OUTPUT, ":utf8" );
open( my $fh, '<:encoding(UTF-8)', $fichier ) or die "Can't open file";
while ( my $row = <$fh> ) {
chomp $row;
@radical_avant = $row =~ /<avant.+?>(.+?)(?:er|é)<\/avant>/;
@radical_apres = $row =~ /<apres.+?>(.+?)(?:er|é)<\/apres>/;
@tableau_avant = $row =~ /<avant.+?>(.+?(?:er|é))<\/avant>/;
@tableau_apres = $row =~ /<apres.+?>(.+?(?:er|é))<\/apres>/;
@ids = $row =~ /<modif (id="\d+")>/;
@extraids = $row =~ /<(extra id="\d+")\s\/>/g;
foreach my $id (@ids) {
foreach my $match_avant (@tableau_avant) {
foreach my $match_apres (@tableau_apres) {
foreach my $radical_avant (@radical_avant) {
foreach my $radical_apres (@radical_apres) {
if ( $radical_avant eq $radical_apres ) {
print OUTPUT "$id\n";
foreach my $extraid (@extraids) {
print OUTPUT "$extraid\n";
}
print OUTPUT "$match_avant" . " |||| " . "$match_apres\n\n";
}
}
}
}
}
}
}
close(OUTPUT);