how to extract content between xml tag?

Question

I have xml data like this i want extract content in ce:afffilliation and sa:afffilliation after extract put into two variable and in sa:affilliation variable make text like ce:affillition and compare two text

<ce:affiliation id="aff1"><ce:label>a</ce:label><ce:textfn>Department of Urology, Radboud University Nijmegen Medical Center, Nijmegen, The Netherlands</ce:textfn><sa:affiliation><sa:organization>Department of Urology</sa:organization><sa:organization>Radboud University Nijmegen Medical Center</sa:organization><sa:city>Nijmegen</sa:city><sa:country>The Netherlands</sa:country></sa:affiliation></ce:affiliation><ce:affiliation id="aff2"><ce:label>b</ce:label><ce:textfn>Norris Comprehensive Cancer Center, University of Southern California Institute of Urology, Los Angeles, California</ce:textfn><ce:affiliation id="aff3"><ce:label>c</ce:label><ce:textfn>Department of Urology, Stanford University, Stanford, California</ce:textfn><sa:affiliation><sa:organization>Department of Urology</sa:organization><sa:organization>Stanford University</sa:organization><sa:city>Stanford</sa:city><sa:state>California</sa:state></sa:affiliation></ce:affiliation>


#!/usr/bin/perl  
@files = <*.xml>;
open my $out, '>', 'output.xml' or die $!;
foreach $file (@files) {
open   (FILE, "$file");
$a =1;
while(my $line= <FILE> ){
do{
if ($line =~ /<ce:affiliation id=\"aff$a\">(.+?)<ce:textfn>(.+?)<\/ce:textfn><sa:affiliation>(.+?)<\/sa:affiliation><\/ce:affiliation>/){
$count = $3;
$textfn = $2;
print ("$count\n");
print ("$textfn\n");
if ($count =~ /<\/sa:(.+?)>/){
$count =~ s/<\/sa:organization>/, /g;
$count =~ s/<\/sa:city>/, /g;
$count =~ s/<\/sa:country>/, /g;
$count =~ s/<\/sa:state>/, /g;
$count =~ s/<sa:organization>//g;
$count =~ s/<sa:city>//g;
$count =~ s/<sa:country>//g;
$count =~ s/<sa:state>//g;
chop($count);
chop($count);
if($count ne $textfn){
print $out("$file affilliation $a is mismatch\n");}}}
else{
if($line =~ /<ce:affiliation id=\"aff$a\">(.+?)<ce:textfn>(.+?)<\/ce:textfn><\/ce:affiliation>/){
print $out("$file sa:affilliation missing for $a\n");}}
$a=$a+1;}
while($line =~ /aff$a/);}}

This code failed if some ce:affillition not contains

<ce:label> and <sa:affillition> 

<ce:affiliation id="aff1"><ce:label>a</ce:label><ce:textfn>Department of Urology, Radboud University Nijmegen Medical Center, Nijmegen, The Netherlands</ce:textfn><sa:affiliation><sa:organization>Department of Urology</sa:organization><sa:organization>Radboud University Nijmegen Medical Center</sa:organization><sa:city>Nijmegen</sa:city><sa:country>The Netherlands</sa:country></sa:affiliation></ce:affiliation><ce:affiliation id="aff2"><ce:textfn>Norris Comprehensive Cancer Center, University of Southern California Institute of Urology, Los Angeles, California</ce:textfn></ce:affiliation><ce:affiliation id="aff3"><ce:label>c</ce:label><ce:textfn>Department of Urology, Stanford University, Stanford, California</ce:textfn><sa:affiliation><sa:organization>Department of Urology</sa:organization><sa:organization>Stanford University</sa:organization><sa:city>Stanford</sa:city><sa:state>California</sa:state></sa:affiliation></ce:affiliation><ce:correspondence id="cor1"></article>

You should look at [`XML::Parser`](https://metacpan.org/pod/XML::Parser), [`XML::LibXML`](https://metacpan.org/pod/XML::LibXML), or equivalent. It's not a great idea to parse xml via regex: See [`why-is-it-such-a-bad-idea-to-parse-xml-with-regex`](http://stackoverflow.com/questions/8577060/why-is-it-such-a-bad-idea-to-parse-xml-with-regex) — chrsblck, Dec 25 '13 at 05:35

score 3 · Answer 1 · answered Dec 25 '13 at 06:31

Please don't use regexes to parse XML. It will work in simple cases, and tchrist demonstrated that you can make it work in the general case - although really you're writing your own XML parser around the regexes at that point - but it's so much easier to just use a library that was written for the purpose.

Example:

use XML::LibXML;

my $parser = XML::LibXML->new;
my $doc = $parser->parse_file('output.xml');
my @badnodes;
foreach my $affil ($doc->findnodes("//*[name()='ce:affiliation']")) {
   push(@badnodes, $affil), last unless $affil->findnodes("*[name()='ce:label']");
   push(@badnodes, $affil), last unless $affil->findnodes("*[name()='sa:affiliation']");
}   
print "Found ${\(~~@badnodes)} bad affiliation elements, with these IDs:\n";
print "\t", join("\n\t", map { $_->getAttribute('id') } @badnodes), "\n";

If I wrap a document element around your first example, and add the missing closing tag on the second affiliation element, I get this output:

Found 1 bad affiliation elements, with these IDs:
    aff2

Is there any method to parse this affilliation without using XML::LibXML because this library not available in my office system. please can u help me any alternative method and i also tried with closing tag in 2nd affilliation with my program but getting wrong — Kathir .K, Dec 25 '13 at 07:12
It's a CPAN module, which you should be able to install yourself if it's not there already. See [the instructions](http://www.cpan.org/modules/INSTALL.html). — Mark Reed, Dec 25 '13 at 07:19

how to extract content between xml tag?

1 Answers1