0

I have xml files as input. In those xml files,there are tag such as:

First Instance:

<xref ref-type="bibr" rid="perl-ch006-bib080"><sup>80</sup></xref><sup>&#x2013;</sup><xref   ref-type="bibr" rid="perl-ch006-bib082"><sup>82</sup></xref>

Second Instance:

<xref ref-type="bibr" rid="perl-ch001-bib009"><sup>9</sup></xref><sup>&#x2013;</sup><xref ref-type="bibr" rid="perl-ch001-bib012"><sup>12</sup></xref><sup>,</sup><xref ref-type="bibr" rid="perl-ch001-bib057"><sup>57</sup></xref><sup>&#x2013;</sup><xref ref-type="bibr" rid="perl-ch001-bib059"><sup>59</sup></xref>

in the above two instances there are numbers 80 and 82, where 81 is missing,9-12,57-59 and – is the entity for -(hypen). I need to copy the entire data of the xml file and add the missing range in that particular position.

Output should be as follow: For First Instance:(i.e. in the follwing pattern 80 81-82)

<xref ref-type="bibr" rid="perl-ch006-bib080"><sup>80</sup></xref><xref ref-type="bibr" rid="perl-ch006-bib081"><sup>81</sup></xref><sup>&#x2013;</sup><xref ref-type="bibr" rid="perl-ch006-bib082"><sup>82</sup></xref>

For Second Instance: (i.e. in the follwing pattern 9 10 11-12, 57 58-59)

<xref ref-type="bibr" rid="perl-ch001-bib009"><sup>9</sup></xref><xref ref-type="bibr" rid="perl-ch001-bib010"><sup>10</sup></xref><xref ref-type="bibr" rid="perl-ch001-bib011"><sup>11</sup></xref><sup>&#x2013;</sup><xref ref-type="bibr" rid="perl-ch001-bib012"><sup>12</sup></xref><sup>,</sup><xref ref-type="bibr" rid="perl-ch001-bib057"><sup>57</sup></xref><xref ref-type="bibr" rid="perl-ch001-bib058"><sup>58</sup></xref><sup>&#x2013;</sup><xref ref-type="bibr" rid="perl-ch001-bib059"><sup>59</sup></xref>

All the changes are to be done in the output files, so that input files are not hampered.

Code:

#!/usr/bin/perl
use strict;
use Cwd;
use File::Basename;
use File::Copy;

my $path1=getcwd;
opendir(INP, "$path1\/Input");
my @out = grep(/.(xml)$/,readdir(INP));
close INP;

foreach my $final(@out)
{
my $filetobecopied = "Input\/".$final;
my $newfile = $final;
copy($filetobecopied, $newfile) or die "File cannot be copied.";
}

opendir DIR, $path1 or die "cant open dir";
my @files = grep /(.*?)\.(xml)$/,(readdir DIR);
closedir DIR;

open(F6, ">Ref.txt");
print F6 "FileName\tMatchedString\tOutput\n";

foreach my $f(@files)
{
open(F1, "<$f") or die "Cannot open file: $files[0]";
my $data=join("", <F1>);
close F1;
my $xml_list=$data;
#print F6 $xml_list."\n";
$xml_list=~s/&#x2013;/-/gs;
$xml_list=~s/&#x02013;/-/gs;

while($xml_list=~m/(<xref ref-type="(bibr|bib)" rid="(.*?)-ch(\d+)-(bibr|bib)(\d+)">(<sup>)?(\d+)(<\/sup>)?<\/xref><sup>(-)+<\/sup>)(<xref ref-type="(bibr|bib)" rid="(.*?)-ch(\d+)-bib(\d+)">(<sup>)?(\d+)(<\/sup>)?<\/xref>)/igs)
{
my $i;
my $xref=$1;my $bibr=$2;
my $rid=$3; my $ch=$4;my $bib=$6;my $hyp=$10;
my $num=$8;
my $xref1=$11;
my $num1=$17;

if($hyp=~m/(-)/gs)
{
my $counter=$num;
while($counter<=$num1)   #for($counter=$num;$counter<=$num1;$counter++)
{
#print F6 "<xref ref-type=\"$bibr\" rid=\"$rid\-ch$ch\-$bibr$counter\"><sup>$counter<\/sup><\/xref>,"."\n";
$counter++;
}
}
}

$xml_list=~s/&orb;/\(/g;
$xml_list=~s/&crb;/\)/g;
$xml_list=~s/-/&#x2013;/gs;
$xml_list=~s/-/&#x02013;/gs;

open(OUT, ">$path1\/Output\/$f");
print OUT $xml_list;
close OUT
}
foreach my $del(@files)
{
unlink $del
}

Any help would be appreciated..

flora
  • 25
  • 5
  • 2
    Please use an XML parser and not regex, see http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 (while that's for (X)HTML it applies for XML too). – Steffen Ullrich Dec 05 '14 at 09:15
  • Is there a reason, that you still keep the hyphen for the last part? And listen to @steffen, he is so right. Consider [XML::Twig](https://metacpan.org/pod/XML::Twig), [XML::Parser::Expat](https://metacpan.org/pod/distribution/libxml-enno/lib/XML/Parser/Expat.pod) or [XML::SAX](https://metacpan.org/pod/XML::SAX) – Patrick J. S. Dec 05 '14 at 11:04
  • Thank you..We will definitely have a look on the above link. Can you guys please explain me how to use and install modules as I am quite new to use modules and all....@PatrickJ.S. yes we need hyphen in the last part(No reason as such, but that's the requirement). Is there any possibility that we can skip using modules and run the above code with changes?? What changes do i need to do in the above code?? – flora Dec 05 '14 at 17:15
  • Can someone please help me out with the code....I am stucked badly... – flora Dec 10 '14 at 06:07
  • Can someone please help me out at the earliest with the above code. I tried using substituion but it isn't working.. Any new idea would be appreciated... – flora Jan 13 '15 at 06:39

1 Answers1

0

You already got fairly far with your program. Mainly missing is only the adding of the missing xref parts at the right positions. This adding into $xml_list can be done with substr; the offset for the insertion can be obtained from the @LAST_MATCH_END array. The core of your code then becomes:

#$xml_list=~s/&#x2013;/-/gs;    don't do this (gives trouble when changing back)
#$xml_list=~s/&#x02013;/-/gs;   don't do this (gives trouble when changing back)

while ($xml_list=~/(<xref\ ref-type="(bibr?)"\ rid="(.*?)-ch(\d+)-(bibr?)(\d+)">
                       (<sup>)?(\d+)(<\/sup>)?
                    <\/xref>)<sup>(&\#x0?2013;)+<\/sup>
                   (<xref\ +ref-type="(bibr?)"\ rid="(.*?)-ch(\d+)-bib(\d+)">
                       (<sup>)?(\d+)(<\/sup>)?
                    <\/xref>)
                  /igsx)
{
    my $insert=$+[1];   # end of first (<xref.../xref>) submatch; here we insert
    my ($bibr,$rid,$ch,$bib)=($2,$3,$4,$5.$6);
    my $num=$8;
    my $num1=$17;
    my $endpos = pos $xml_list;
    for (my $counter=$num; ++$counter<$num1; )
    {
        ++$bib;
        my $insertion = "<xref ref-type=\"$bibr\" rid=\"$rid-ch$ch-$bib\">"
                           ."<sup>$counter</sup>"
                       ."</xref>\n";    # insert this into $xml_list at $insert 
        substr $xml_list, $insert, 0, $insertion;
        $insert += length $insertion;   # push start of next insert to the right
        $endpos += length $insertion;   # push start of next search to the right
    }
    pos $xml_list = $endpos;    # set start position of next search
}

#$xml_list=~s/-/&#x2013;/gs;    trouble: would also change normal hyphens
Armali
  • 18,255
  • 14
  • 57
  • 171