1

I need to decode a complex XML structure. The XML looks like this:

<?xml version="1.0" encoding="ISO-8859-1"?>
    <MainNode comment="foo">
      <FirstMainBranch>
        <Struct>
          <String name="aStringValueUnderMainBranch" comment="Child node under first main branch"/>
          <String name="anotherStringValueUnderMainBranch" comment="Child node under first main branch"/>
          <Integer name="anIntegerValueUnderMainBranch" comment="Child node under first main branch"/>
          <List name="aList" comment="According to me this node should be an array, it could contain one or more child elements">
            <Struct comment="The node name means that, the child nodes are grouped, I think that the most appropriate structure here is hash. 
        The node itself doesn't have name attribute, which means that it only shows the type of the element">
          <String name="first" comment="
            Default Value: 0 
                        "/>
          <Long name="second" comment="
            Default Value: 0 

                          "/>
          <Long name="third" comment="
            Default Value: 0 

                        "/>
        </Struct>
      </List>
      <List name="secondList" comment="According to me this node should be array, it could contain one or more child elements">
        <Struct comment="The node name means that, the child nodes are grouped, I think that the most appropriate structure here is hash. 
        The node itself doesn't have name attribute, which means that it only shows the type of the element
                    ">
          <String name="first" comment="
            Default Value: 0 

                          "/>
          <Long name="second" comment="
            Default Value: 0 

                          "/>        
        </Struct>
      </List>
      <Struct name="namedStruct" comment="Here the struct element has a name, which means that it should be decoded
                    ">
        <List name="thirdList" comment="Again list, but now it is inside struct element, and it contains struct element
                ">
          <Struct comment="The node name means that, the child nodes are grouped, I think that the most appropriate structure here is hash.">
            <Integer name="first" comment="Child element of the struct"/>
          </Struct>
        </List>

      </Struct>

    </Struct>
  </FirstMainBranch>
  <SecondMainBranch>
    <Struct comment="">
      <Struct name="namedStructAgain" comment="
                ">
        <String name="First" comment="
                  "/>
        <String name="Second" comment=""/>

      </Struct>
    </Struct>
  </SecondMainBranch>
</MainNode>

I think that the most appropriate container is a hash (if your opinion is different, please let me know). I'm finding difficult to decode it, because:

  1. Main nodes do not have "name" attribute, but they should exist in the final structure

  2. Child nodes should be read only if there is a "name" attribute, but their data type (structure) depends on not decoded parent element.

  3. Some of these parent elements have "name" attribute - in this case they should exist in the final structure.

  4. I don't care for integer, long, datetime etc. data types, they will be read as string. The main problem here is List and Struct types

Here is my silly try to cope with the task:

use XML::LibXML;
use Data::Dumper;
use strict;
use warnings;
my $parser=XML::LibXML->new();
my $file="c:\\joro\\Data.xml";
my $xmldoc=$parser->parse_file($file);

sub buildHash{
my $mainParentNode=$_[0];
my $mainHash=\%{$_[1]};
my ($waitNextNode,$isArray,$arrayNode);
$waitNextNode=0;
$isArray=0;
sub xmlStructure{
my $parentNode=$_[0];
my $href=\%{$_[1]};
my ($name, %tmp);
my $parentType=$parentNode->nodeName();
$name=$parentNode->findnodes('@name');
foreach my $currentNode($parentNode->findnodes('child::*')){
my $type=$currentNode->nodeName();
if ($type&&$type eq 'List'){
$isArray=1;
}
elsif($type&&$type ne 'List'&&$parentType ne 'List'){
$isArray=0;
$arrayNode=undef;
}
if ($type&&!$currentNode->findnodes('@name')&&$type eq 'Struct'){
$waitNextNode=1;
}
else{
$waitNextNode=0;
}
if ($type&&$type ne 'List'&&$type ne 'Struct'&&!$currentNode->findnodes('@name')){
#$href->{$currentNode->nodeName()}={};
xmlStructure($currentNode,$href->{$currentNode->nodeName()});
}
# elsif ($type&&$type eq 'List'&&$currentNode->findnodes('@name')){
# print "2\n";
# $href->{$currentNode->findnodes('@name')}=[];
# xmlStructure($currentNode,$href->{$currentNode->findnodes('@name')});
# }
elsif ($type&&$type ne 'List'&&$currentNode->findnodes('@name')&&$parentType eq 'List'){
push(@{$href->{$currentNode->findnodes('@name')}},$currentNode->findnodes('@name'));
xmlStructure($currentNode,$href->{$currentNode->findnodes('@name')});

}
# elsif ($type&&$type ne 'List'&&!$currentNode->findnodes('@name')&&$parentType eq 'List'){
# print "4\n";
# push(@{$$href->{$currentNode->findnodes('@name')}},{});
##print Dumper %{$arrayNode};
# xmlStructure($currentNode,$href->{$currentNode->findnodes('@name')});
# }
else{
xmlStructure($currentNode,$href->{$currentNode->findnodes('@name')});
}
}

}
xmlStructure($mainParentNode,$mainHash);
}
my %href;
buildHash($xmldoc->findnodes('*'),\%href);
print "Printing the real HASH\n";
print Dumper %href;

but there is a long way to go, because: 1. There is a parasite, probably undefined, element between the key and the value. 2. I cannot find the way to change the data type from hash to array of the child where needed.

Here is the output:

$VAR1 = 'FirstMainBranch';
$VAR2 = {
          '' => {
                  'aList' => {
                             '' => {
                                     'third' => {},
                                     'second' => {},
                                     'first' => {}
                                   }
                           },
                  'namedStruct' => {
                                   'thirdList' => {
                                                  '' => {
                                                          'first' => {}
                                                        }
                                                }
                                 },
                  'anotherStringValueUnderMainBranch' => {},
                  'secondList' => {
                                  '' => {
                                          'second' => {},
                                          'first' => {}
                                        }
                                },
                  'aStringValueUnderMainBranch' => {},
                  'anIntegerValueUnderMainBranch' => {}
                }
        };
$VAR3 = 'SecondMainBranch';
$VAR4 = {
          '' => {
                  'namedStructAgain' => {
                                        'First' => {},
                                        'Second' => {}
                                      }
                }
        };

Any help will be appreciated. Thank you in advance.

Edit: In relation with Sobrique's comment - X Y Problem:

Here is the example string I want to parse:

(1,2,"N/A",-1,"foo","bar",NULL,3,2016-03-18 08:12:00.000,2016-03-18 08:12:00.559,2016-03-18 08:12:00.520,0,0,NULL,"foo","123456789",{NULL,NULL,NULL,NULL,NULL,NULL,2016-04-17 11:59:59.999,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,null,NULL,NULL,NULL,NULL,3,0,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,T,0,NULL,NULL,NULL,"9876543210",NULL,"foo","0","bar","foo","a1820000264d979c","0,0",NULL,"foo","192.168.1.82","SOAP",NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL},{INPUT="bar"},{aStringValueUnderMainBranch="ET", aList[{"first", "second", "third"}, {"first", "second", "third"}], secondList[{"first", "second"}, {"first", "second"}],namedStruct{thirdList[{first},{first}]}},{namedStructAgain{"first", "second"}},NULL,NULL,NULL,NULL,NULL)

Somehow I should separate all values and after that to identify this part:

{aStringValueUnderMainBranch="ET", aList[{"first", "second", "third"}, {"first", "second", "third"}], secondList[{"first", "second"}, {"first", "second"}],namedStruct{thirdList[{first},{first}]}}

as FirstMainBranch and parse the corresponding values as showed in the XML. After that I should identify:

{namedStructAgain{"first", "second"}}

as SecondMainBranch and get the respective values. There is an additional problem here with the primary data separation I should not take in mind the commas when they are between parentheses.

g_e_s_h
  • 11
  • 3
  • Sounds like you're trying to recreate XML::Simple (and all of its problems). – ikegami Apr 05 '16 at 13:36
  • I can't completely understand - should I use XML::Simple for this task? – g_e_s_h Apr 05 '16 at 13:42
  • 1
    No, I don't think you should be creating this hard to navigate structure in the first place. See [Why is XML::Simple “Discouraged”?](http://stackoverflow.com/questions/33267765/why-is-xmlsimple-discouraged/33273488). – ikegami Apr 05 '16 at 13:59
  • This smells like an [`XY Problem`](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) - stop; rethink. What are you trying to accomplish? `XML` is more complicated than can be represented in perl data structures. But you have OO For that, which is what `XML::LibXML` does. – Sobrique Apr 20 '16 at 20:36
  • Hi probably you are right. I have edited the post with additional info, maybe there is another approach to this task. I will take the value names from additional flat XML. This values will always exist in the first "big" string, some of the nodes in the XML in the example above are optional and I should estimate if the value exists by the text before '{' or '[' or '=' signs. – g_e_s_h Apr 21 '16 at 11:44
  • The reason I suggest this looks like an XY problem is because a key symptom then is focussing on intermediate steps. Processing XML is easy. "Converting" it it almost certainly redundant. What is your _end goal_ here? Give us a sample output for that input, and we can almost certainly show you a really easy way to accomplish it. – Sobrique Apr 21 '16 at 14:42
  • Hi, I've already added information about the string I want to parse by the rules in the XML (you can see it after "Edit:" in the post above). Finaly I want to be able to know the values of the variables for example: FirstMainBranch{aList}[1]{second}='second' – g_e_s_h Apr 25 '16 at 06:34

1 Answers1

0

I would use a different approach. Instead of converting the XML into a hash, I would map it to objects using XML::Rabbit. I wrote a small article about how to use it with a complete working example.

XML::Rabbit has a series of advantages:

  • Work with simple Moose objects.
  • Define the objects to be obtained in a declarative way, using XPath.
  • Parse / define only what you want. No need to get all the information out of the XML.

If your XML files are small enough for using XPath and a DOM I've found this method very clean and easy to maintain.

LaintalAy
  • 1,162
  • 2
  • 15
  • 26
  • I will use the collected information from the XML to parse a CSV file, do you think that the approach with XML::Rabbit is appropriate? I'm starting to read the article now :) – g_e_s_h Apr 06 '16 at 06:16
  • I read the article and I don't think that it can solve my problems because, getting the node names is not problem. The hard part for me is to get the right data structure of the keys in the resulting hash. I want to know if the current key is an array, containing hashes, or hash containing arrays, or array containing simple values etc. Many thank for the answer though :) – g_e_s_h Apr 06 '16 at 07:20