1

How can I concatenate multiple XML files from different directories into a single XML file using Perl?

1 Answers1

1

I've had to make quite a lot of assumptions to do this, but here's my answer:

#!/usr/bin/perl -w

use strict;
use XML::LibXML;

my $output_doc = XML::LibXML->load_xml( string => <<EOF);
<?xml version="1.0" ?>
<issu-meta xmlns="ver2">
 <metadescription>
       <num-objects xml:id='total'/>
 </metadescription>
 <compatibility>
      <baseline> 6.2.1.2.43 </baseline>
 </compatibility>
</issu-meta> 

EOF

my $object_count = 0;

foreach (@ARGV) {
  my $input_doc = XML::LibXML->load_xml( location => $_ );
  foreach ($input_doc->findnodes('/*[local-name()="issu-meta"]/*[local-name()="basictype"]')) {  # find each object
    my $object = $output_doc->importNode($_, 1);  # import the object information into the output document
    $output_doc->documentElement->appendChild($object);  # append the new XML nodes to the output document root
    $object_count++;  # keep track of how many objects we've seen
  }
}

my $total = $output_doc->getElementById('total');  # find the element which will contain the object count
$total->appendChild($output_doc->createTextNode($object_count));  # append the object count to that element
$total->removeAttribute('xml:id');  # remove the XML id, as it's not wanted in the output

print $output_doc->toString;  # output the final document

Firstly, the <comp> element seems to come from nowhere, so I've had to ignore that. I'm also assuming that the required output content before each of the <basictype> elements is always going to be the same, except for the object count.

So I build an empty output document to start with, and then iterate over each filename provided on the commandline. For each, I find each object and copy it into the output file. Once I've done all the input files, I insert the object count.

It's made more difficult by the use of xmlns on the files. This makes the XPath search expression more complicated than it needs to be. If possible, I'd be tempted to remove the xmlns attributes and you'd be left with:

foreach ($input_doc->findnodes('/issu-meta/basictype')) {

which is a lot simpler.

So, when I run this:

perl combine abc/a.xml xyz/b.xml

I get:

<?xml version="1.0"?>
<issu-meta xmlns="ver2">
 <metadescription>
       <num-objects>3</num-objects>
 </metadescription>
 <compatibility>
      <baseline> 6.2.1.2.43 </baseline>
 </compatibility>
<basictype>
       <id> 1 </id>
       <name> pointer </name>
       <pointer/>
       <size> 64 </size>
</basictype><basictype>
     <id> 4 </id>
     <name> int32_t </name>
     <primitive/>
     <size> 32 </size>
 </basictype><basictype>
      <id> 2 </id>
      <name> int8_t </name>
      <primitive/>
      <size> 8 </size>
</basictype></issu-meta>

which is pretty close to what you're after.

Edit: OK, my answer now looks like this:

#!/usr/bin/perl -w

use strict;
use XML::LibXML qw( :libxml );  # load LibXML support and include node type definitions

my $output_doc = XML::LibXML->load_xml( string => <<EOF);  # create an empty output document
<?xml version="1.0" ?>
<issu-meta xmlns="ver2">
 <metadescription>
       <num-objects xml:id='total'/>
 </metadescription>
 <compatibility>
      <baseline> 6.2.1.2.43 </baseline>
 </compatibility>
</issu-meta> 

EOF

my $object_count = 0;

foreach (@ARGV) {
  my $input_doc = XML::LibXML->load_xml( location => $_ );

  my $import_started = 0;
  foreach ($input_doc->documentElement->childNodes) {
    next unless $_->nodeType == XML_ELEMENT_NODE;  # if it's not an element, ignore it

    if ($_->localName eq 'compatibility') {  # if it's the "compatibility" element, ...
      $import_started = 1;  # ... switch on importing ...
      next;  # ... and move to the next child of the root
    }

    next unless $import_started;  # if we've not started importing, and it's
                                  #   not the "compatibility" element, simply
                                  #   ignore it and move on

    my $object = $output_doc->importNode($_, 1);  # import the object information into the output document
    $output_doc->documentElement->appendChild($object);  # append the new XML nodes to the output document root
    $object_count++;  # keep track of how many objects we've seen
  }
}

my $total = $output_doc->getElementById('total');  # find the element which will contain the object count
$total->appendChild($output_doc->createTextNode($object_count));  # append the object count to that element
$total->removeAttribute('xml:id');  # remove the XML id, as it's not wanted in the output

print $output_doc->toString;  # output the final document

which simply imports each element which is a child of the root <issu-meta> document element after the first <compatibility> element it finds, and, as before, updates the object count. If I've understood your requirement that should do you.

If it works, I strongly suggest you work through both this answer and my earlier one to ensure you understant why it works for your problem. There are lots of useful technologies used in here, and once you understand it, you will have learned a lot about some of the ways you can manipulate XML. Any problems, ask a new question on this site. Have fun!

Edit #2: Right, this should be the last piece you need:

#!/usr/bin/perl -w

use strict;
use XML::LibXML qw( :libxml );  # load LibXML support and include node type definitions

my @input_files = (
                    'abc/a.xml',
                    'xyz/b.xml',
                  );
my $output_file = 'output.xml';

my $output_doc = XML::LibXML->load_xml( string => <<EOF);  # create an empty output document
<?xml version="1.0" ?>
<issu-meta xmlns="ver2">
 <metadescription>
       <num-objects xml:id='total'/>
 </metadescription>
 <compatibility>
      <baseline> 6.2.1.2.43 </baseline>
 </compatibility>
</issu-meta> 

EOF

my $object_count = 0;

foreach (@input_files) {
  my $input_doc = XML::LibXML->load_xml( location => $_ );

  my $import_started = 0;
  foreach ($input_doc->documentElement->childNodes) {
    next unless $_->nodeType == XML_ELEMENT_NODE;  # if it's not an element, ignore it

    if ($_->localName eq 'compatibility') {  # if it's the "compatibility" element, ...
      $import_started = 1;  # ... switch on importing ...
      next;  # ... and move to the next child of the root
    }

    next unless $import_started;  # if we've not started importing, and it's
                                  #   not the "compatibility" element, simply
                                  #   ignore it and move on

    my $object = $output_doc->importNode($_, 1);  # import the object information into the output document
    $output_doc->documentElement->appendChild($object);  # append the new XML nodes to the output document root
    $object_count++;  # keep track of how many objects we've seen
  }
}

my $total = $output_doc->getElementById('total');  # find the element which will contain the object count
$total->appendChild($output_doc->createTextNode($object_count));  # append the object count to that element
$total->removeAttribute('xml:id');  # remove the XML id, as it's not wanted in the output

$output_doc->toFile($output_file, 1);  # output the final document

After running like this: perl combine the file output.xml is created, with the following contents:

<?xml version="1.0"?>
<issu-meta xmlns="ver2">
 <metadescription>
       <num-objects>7</num-objects>
 </metadescription>
 <compatibility>
      <baseline> 6.2.1.2.43 </baseline>
 </compatibility>
<basictype>
       <id> 1 </id>
       <name> pointer </name>
       <pointer/>
       <size> 64 </size>
</basictype><basictype>
     <id> 4 </id>
     <name> int32_t </name>
     <primitive/>
     <size> 32 </size>
 </basictype><enum>
      <id>1835009 </id>
      <name> chkpt_state_t </name>
      <label>
           <name> CHKP_STATE_PENDING </name>
      <value> 1 </value>
      </label>
  </enum><struct>
         <id> 1835010 </id>
          <name> _ipcEndpoint </name>
          <size> 64 </size>
          <elem>
              <id> 0 </id>
              <name> ep_addr </name>
              <type> uint32_t </type>
              <type-id> 8 </type-id>
              <size> 32 </size>
             <offset> 0 </offset>
         </elem>
   </struct><basictype>
      <id> 2 </id>
      <name> int8_t </name>
      <primitive/>
      <size> 8 </size>
</basictype><alias>
     <id> 1835012 </id>
     <name> Endpoint </name>
     <size> 64 </size>
     <type> _ipcEndpoint </type>
     <type-id> 1835010 </type-id>
</alias><bitmask>
      <id> 1835015 </id>
      <name> ipc_flag_t </name>
      <size> 8 </size>
      <type> uint8_t </type>
      <type-id> 6 </type-id>
      <label>
           <name> IPC_APPLICATION_REGISTER_MSG </name>
           <value> 1 </value>
      </label>
 </bitmask></issu-meta>

Last tip: although it makes almost no difference to the XML, it's a little more human-readable once it's been run through xmltidy:

<?xml version="1.0"?>
<issu-meta xmlns="ver2">
  <metadescription>
    <num-objects>7</num-objects>
  </metadescription>
  <compatibility>
    <baseline> 6.2.1.2.43 </baseline>
  </compatibility>
  <basictype>
    <id> 1 </id>
    <name> pointer </name>
    <pointer/>
    <size> 64 </size>
  </basictype>
  <basictype>
    <id> 4 </id>
    <name> int32_t </name>
    <primitive/>
    <size> 32 </size>
  </basictype>
  <enum>
    <id>1835009 </id>
    <name> chkpt_state_t </name>
    <label>
      <name> CHKP_STATE_PENDING </name>
      <value> 1 </value>
    </label>
  </enum>
  <struct>
    <id> 1835010 </id>
    <name> _ipcEndpoint </name>
    <size> 64 </size>
    <elem>
      <id> 0 </id>
      <name> ep_addr </name>
      <type> uint32_t </type>
      <type-id> 8 </type-id>
      <size> 32 </size>
      <offset> 0 </offset>
    </elem>
  </struct>
  <basictype>
    <id> 2 </id>
    <name> int8_t </name>
    <primitive/>
    <size> 8 </size>
  </basictype>
  <alias>
    <id> 1835012 </id>
    <name> Endpoint </name>
    <size> 64 </size>
    <type> _ipcEndpoint </type>
    <type-id> 1835010 </type-id>
  </alias>
  <bitmask>
    <id> 1835015 </id>
    <name> ipc_flag_t </name>
    <size> 8 </size>
    <type> uint8_t </type>
    <type-id> 6 </type-id>
    <label>
      <name> IPC_APPLICATION_REGISTER_MSG </name>
      <value> 1 </value>
    </label>
  </bitmask>
</issu-meta>

Good luck working through this and taking it further. Do come back to this site to ask more questions when they come up!

Tim
  • 9,171
  • 33
  • 51
  • Thank you very much Tim for your help. its very near to my requirement. – Seshagiri Lekkala Sep 12 '14 at 18:34
  • Apart from tag, i have many different type of tags available in my actual data. Till the tag every thing is same for all XML files. So i should be copy all tags starting below tag to till end. can you please help me to do the same. – Seshagiri Lekkala Sep 12 '14 at 18:46
  • I have updated input XML files and output XML. Can you please refer the same. – Seshagiri Lekkala Sep 12 '14 at 19:43
  • Thanks lot a for your time. Its working exactly as i expected. Can you please incorporate following with your solution 1) Update input file names into an array and read. Instead of from console. 2) Copy all integrated XML files content into new XML file instead of printing on console. – Seshagiri Lekkala Sep 12 '14 at 23:49
  • I wish to vote up for your answer, but unfortunately i do not have enough reputations to vote up. Minimum 15 reputations is required. – Seshagiri Lekkala Sep 14 '14 at 18:29
  • No problem - I think this should put you on the right track now! – Tim Sep 14 '14 at 21:06
  • Hi Tim, One final help pls. I am getting "I/O error : Unknown IO error " during execution of this program. looks like file operation is failing at $output_doc->toFile($output_file, 1). can you please let me know if you have any idea. – Seshagiri Lekkala Sep 15 '14 at 21:48
  • This could be one of a whole load of things, and really is a new question. I strongly recommend that you start a new question on this site and describe the problem you're having. You can link back to this question and answer to cover some of the detail. This way you should get more attention from people who understand what's going on. In the new question you should mention that you're using Perl and XML::LibXML. Maybe see you at the new one. – Tim Sep 16 '14 at 06:23
  • Thank you. I have initiated new question – Seshagiri Lekkala Sep 16 '14 at 17:02