Simplify Perl script to extract information with regex

Question

I have to extract some information from an XML file according to a pattern. I did complete a working script, but I'm pretty sure that it could be a lot simpler and/or cleaner.

Could you tell me what could be better and why?

What my input looks like:

<modifs>
  <modif id="14661"><code c="1" /><extra id="109816" /><avant num_words="1">démissionné</avant><apres num_words="1">démissionner</apres></modif>
  <modif id="125247"><code c="1" /><avant num_words="1">demis-tons</avant><apres num_words="1">demi-tons</apres></modif>
  <modif id="90891"><code c="1" /><avant num_words="1">démit</avant><apres num_words="1">démis</apres></modif>
  <modif id="198379"><code c="1" /><avant num_words="1">demi-terain</avant><apres num_words="1">demi-terrain</apres></modif>
  <modif id="172795"><code c="1" /><avant num_words="1">demi-ton</avant><apres num_words="1">demi-tons</apres></modif>
</modifs>

What I want :+

Display, when the content of avant and apres tags ends with -é or -er, each id and extra id, followed by the content of avant and apres.

So it looks like this:

id="14661"
extra id="109816"
démissionné |||| démissionner

What my script looks like:

 use strict;
    use warnings;


    my $fichier = 'path';
    my $fichiersortie = "path";
    my @lignes ;
    my @tableau_avant ;
    my @tableau_apres ;
    my @ids ;
    my @extraids ; 
    my @radical_avant ;
    my @radical_apres ;

    open (OUTPUT, ">$fichiersortie");
    binmode(OUTPUT, ":utf8");
    open(my $fh, '<:encoding(UTF-8)', $fichier)
      or die "Can't open file";

    while (my $row = <$fh>) {
        chomp $row;

        @radical_avant = $row =~ /<avant.+?>(.+?)(?:er|é)<\/avant>/;
        @radical_apres = $row =~ /<apres.+?>(.+?)(?:er|é)<\/apres>/ ;
        @tableau_avant = $row =~ /<avant.+?>(.+?(?:er|é))<\/avant>/;
        @tableau_apres = $row =~ /<apres.+?>(.+?(?:er|é))<\/apres>/ ;
        @ids = $row =~ /<modif (id="\d+")>/ ;
        @extraids = $row =~ /<(extra id="\d+")\s\/>/g ;


        foreach my $id (@ids) {
        foreach my $match_avant (@tableau_avant) {
        foreach my $match_apres (@tableau_apres) {

        foreach my $radical_avant (@radical_avant){
        foreach my $radical_apres (@radical_apres){
        if ($radical_avant eq $radical_apres) {

        print OUTPUT "$id\n";
foreach my $extraid (@extraids) {
        print OUTPUT "$extraid\n";}
        print OUTPUT "$match_avant" . " |||| " . "$match_apres\n\n" ;}
        }
        }
                }
        }
        }
}
close (OUTPUT);

Tidied up, the Perl code looks like this

use strict;
use warnings;

my $fichier       = 'path';
my $fichiersortie = "path";
my @lignes;
my @tableau_avant;
my @tableau_apres;
my @ids;
my @extraids;
my @radical_avant;
my @radical_apres;

open( OUTPUT, ">$fichiersortie" );
binmode( OUTPUT, ":utf8" );

open( my $fh, '<:encoding(UTF-8)', $fichier ) or die "Can't open file";

while ( my $row = <$fh> ) {
    chomp $row;

    @radical_avant = $row =~ /<avant.+?>(.+?)(?:er|é)<\/avant>/;
    @radical_apres = $row =~ /<apres.+?>(.+?)(?:er|é)<\/apres>/;
    @tableau_avant = $row =~ /<avant.+?>(.+?(?:er|é))<\/avant>/;
    @tableau_apres = $row =~ /<apres.+?>(.+?(?:er|é))<\/apres>/;
    @ids           = $row =~ /<modif (id="\d+")>/;
    @extraids      = $row =~ /<(extra id="\d+")\s\/>/g;

    foreach my $id (@ids) {

        foreach my $match_avant (@tableau_avant) {

            foreach my $match_apres (@tableau_apres) {

                foreach my $radical_avant (@radical_avant) {

                    foreach my $radical_apres (@radical_apres) {

                        if ( $radical_avant eq $radical_apres ) {

                            print OUTPUT "$id\n";

                            foreach my $extraid (@extraids) {
                                print OUTPUT "$extraid\n";
                            }

                            print OUTPUT "$match_avant" . " |||| " . "$match_apres\n\n";
                        }
                    }
                }
            }
        }
    }
}

close(OUTPUT);

It Looks like a Kind of XML. why you do not use a XML Parser? — Jens, Jul 06 '16 at 13:00
The only way I know that would have made it possible to do so would be using XSLT 2.0 (which supports regular expressions for example). I don't have access to XSLT 2.0. Plus, my question was more about improving my Perl skills and understanding than about the best language to do what I want. — Azaghal, Jul 06 '16 at 13:12
I might have not understood your comment well: if you're speaking about a Perl module, I'd be happy to know which one would be good and how to use it in my case. — Azaghal, Jul 06 '16 at 13:19
@Baptiste: Pas Jean Baptiste?! Do you know that you can install any CPAN module to any location? I will look at your program immediately, but I think you should use [XML::Twig](https://metacpan.org/pod/XML::Twig) to extract the data that you need. Regular expressions are very much the *wrong* tool — Borodin, Jul 06 '16 at 13:20
Is that really what the input looks like, or is there another tag that surrounds all of the tags you've shown us. Without an enclosing tag, this isn't XML and no XML parser will parse it successfully. — Dave Cross, Jul 06 '16 at 13:22
I can't get your code to produce any output with the XML data that you show. Please would you show a combined example of your code and data that works? — Borodin, Jul 06 '16 at 13:37
@DaveCross: I added the surrounding `modifs` that is indeed here @Borodin: As I said I'm quite new to Perl and try not to use too much modules I don't know / understand. (By the way no it is not Jean Baptiste, only Baptiste ;) ) — Azaghal, Jul 06 '16 at 13:38
@Baptiste: For the document to be valid XML, it would need another tag that goes round all of your tags. A well-formed XML document can have only one top-level element. — Dave Cross, Jul 06 '16 at 13:48
@Borodin I tested and updated my source XML and the output I get. The only things to change are the paths for input and output files in the Perl code. — Azaghal, Jul 06 '16 at 14:08
Can you give some desired output for that input? It looks a lot simpler if you switch to using `XML::Twig` for processing. — Sobrique, Jul 06 '16 at 14:42
@Sobrique: The output I get now (and give in the section "what I want") is quite close to what I want. As I mention in my question, what matters to me is to get, for each tag the corresponding ids, extra id and / content. I would be really glad to see how to use `XML::Twig` in this case. — Azaghal, Jul 06 '16 at 14:47

score 3 · Accepted Answer · edited May 23 '17 at 11:58

Don't use regular expressions to parse XML. It leads to brittle code.

perl has an XML parser, and that would look something like this:

#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;

#load the file into the XML parser, as $twig
my $twig = XML::Twig->new->parsefile('your_file.xml'); 

#iterate elements '<modif>' - anywhere in data structure. 
#(That's what // prefix means in xpath)
foreach my $modif ( $twig->get_xpath('//modif') ) {

   #For each modif element, extract the contents of 'avant' and 'apres' and
   #compare them. 
   if (   $modif->first_child_text('avant') =~ m/(er|é)$/
      and $modif->first_child_text('apres') =~ m/(er|é)$/ )
   {
      #from this element, get the 'id' attribute. 
      #<modif id="???">
      print "ID: ", $modif->att('id'), "\n";
      #fetch all the children of <modif> called '<extra>' 
      #use map to fetch the 'id' attributes of all of these. (if more than one)
      print "extra ids", join " ",(map { $_->att('id') } $modif->children('extra')), "\n";
      #fetch content of '<avant>' and '<apres>' nodes. 
      print $modif->first_child_text('avant'), "|||", $modif ->first_child_text('apres'),"\n";
   }
}

I wasn't entirely clear on the comparison you were making, but hopefully this illustrates well enough how to do it?

e.g. perhaps you'd be comparing:

#compare both avant and apres, but only after trimming
#a couple of letters off the end. 
if ( $modif -> first_child_text('avant') =~ s/(er|é)$//r 
  eq $modif -> first_child_text('apres') =~ s/(er|é)$//r  ) {

Would compare both, after dropping those character off the end. (note - the r regex modifier needs a newish version of perl to work, it might error)

From what I can tell, the second option is the one I want, as I want `avant` and `apres` tags content to be exactly the same apart from the ending. (So what comes before er or é must be the same). However I don't understand all of your code, would you mind commenting a bit on it? Especially the print lines, which are a bit obscure to me... — Azaghal, Jul 06 '16 at 15:47
Have added some comments. Is that clearer? I can expand further on anything that's still unclear. — Sobrique, Jul 06 '16 at 15:57
You need to `use utf8` if you have any non-ASCII characters in your code. You also need to `use open qw/ :std :encoding(UTF-8) /` so that your encoded data is printed properly — Borodin, Jul 06 '16 at 16:48
@Sobrique that's perfect for me, good answer and great explanation! — Azaghal, Jul 07 '16 at 07:42

Simplify Perl script to extract information with regex

What my input looks like:

What I want :+

What my script looks like:

1 Answers1