I need to decode a complex XML structure. The XML looks like this:
<?xml version="1.0" encoding="ISO-8859-1"?>
<MainNode comment="foo">
<FirstMainBranch>
<Struct>
<String name="aStringValueUnderMainBranch" comment="Child node under first main branch"/>
<String name="anotherStringValueUnderMainBranch" comment="Child node under first main branch"/>
<Integer name="anIntegerValueUnderMainBranch" comment="Child node under first main branch"/>
<List name="aList" comment="According to me this node should be an array, it could contain one or more child elements">
<Struct comment="The node name means that, the child nodes are grouped, I think that the most appropriate structure here is hash.
The node itself doesn't have name attribute, which means that it only shows the type of the element">
<String name="first" comment="
Default Value: 0
"/>
<Long name="second" comment="
Default Value: 0
"/>
<Long name="third" comment="
Default Value: 0
"/>
</Struct>
</List>
<List name="secondList" comment="According to me this node should be array, it could contain one or more child elements">
<Struct comment="The node name means that, the child nodes are grouped, I think that the most appropriate structure here is hash.
The node itself doesn't have name attribute, which means that it only shows the type of the element
">
<String name="first" comment="
Default Value: 0
"/>
<Long name="second" comment="
Default Value: 0
"/>
</Struct>
</List>
<Struct name="namedStruct" comment="Here the struct element has a name, which means that it should be decoded
">
<List name="thirdList" comment="Again list, but now it is inside struct element, and it contains struct element
">
<Struct comment="The node name means that, the child nodes are grouped, I think that the most appropriate structure here is hash.">
<Integer name="first" comment="Child element of the struct"/>
</Struct>
</List>
</Struct>
</Struct>
</FirstMainBranch>
<SecondMainBranch>
<Struct comment="">
<Struct name="namedStructAgain" comment="
">
<String name="First" comment="
"/>
<String name="Second" comment=""/>
</Struct>
</Struct>
</SecondMainBranch>
</MainNode>
I think that the most appropriate container is a hash (if your opinion is different, please let me know). I'm finding difficult to decode it, because:
Main nodes do not have "name" attribute, but they should exist in the final structure
Child nodes should be read only if there is a "name" attribute, but their data type (structure) depends on not decoded parent element.
Some of these parent elements have "name" attribute - in this case they should exist in the final structure.
I don't care for integer, long, datetime etc. data types, they will be read as string. The main problem here is List and Struct types
Here is my silly try to cope with the task:
use XML::LibXML;
use Data::Dumper;
use strict;
use warnings;
my $parser=XML::LibXML->new();
my $file="c:\\joro\\Data.xml";
my $xmldoc=$parser->parse_file($file);
sub buildHash{
my $mainParentNode=$_[0];
my $mainHash=\%{$_[1]};
my ($waitNextNode,$isArray,$arrayNode);
$waitNextNode=0;
$isArray=0;
sub xmlStructure{
my $parentNode=$_[0];
my $href=\%{$_[1]};
my ($name, %tmp);
my $parentType=$parentNode->nodeName();
$name=$parentNode->findnodes('@name');
foreach my $currentNode($parentNode->findnodes('child::*')){
my $type=$currentNode->nodeName();
if ($type&&$type eq 'List'){
$isArray=1;
}
elsif($type&&$type ne 'List'&&$parentType ne 'List'){
$isArray=0;
$arrayNode=undef;
}
if ($type&&!$currentNode->findnodes('@name')&&$type eq 'Struct'){
$waitNextNode=1;
}
else{
$waitNextNode=0;
}
if ($type&&$type ne 'List'&&$type ne 'Struct'&&!$currentNode->findnodes('@name')){
#$href->{$currentNode->nodeName()}={};
xmlStructure($currentNode,$href->{$currentNode->nodeName()});
}
# elsif ($type&&$type eq 'List'&&$currentNode->findnodes('@name')){
# print "2\n";
# $href->{$currentNode->findnodes('@name')}=[];
# xmlStructure($currentNode,$href->{$currentNode->findnodes('@name')});
# }
elsif ($type&&$type ne 'List'&&$currentNode->findnodes('@name')&&$parentType eq 'List'){
push(@{$href->{$currentNode->findnodes('@name')}},$currentNode->findnodes('@name'));
xmlStructure($currentNode,$href->{$currentNode->findnodes('@name')});
}
# elsif ($type&&$type ne 'List'&&!$currentNode->findnodes('@name')&&$parentType eq 'List'){
# print "4\n";
# push(@{$$href->{$currentNode->findnodes('@name')}},{});
##print Dumper %{$arrayNode};
# xmlStructure($currentNode,$href->{$currentNode->findnodes('@name')});
# }
else{
xmlStructure($currentNode,$href->{$currentNode->findnodes('@name')});
}
}
}
xmlStructure($mainParentNode,$mainHash);
}
my %href;
buildHash($xmldoc->findnodes('*'),\%href);
print "Printing the real HASH\n";
print Dumper %href;
but there is a long way to go, because: 1. There is a parasite, probably undefined, element between the key and the value. 2. I cannot find the way to change the data type from hash to array of the child where needed.
Here is the output:
$VAR1 = 'FirstMainBranch';
$VAR2 = {
'' => {
'aList' => {
'' => {
'third' => {},
'second' => {},
'first' => {}
}
},
'namedStruct' => {
'thirdList' => {
'' => {
'first' => {}
}
}
},
'anotherStringValueUnderMainBranch' => {},
'secondList' => {
'' => {
'second' => {},
'first' => {}
}
},
'aStringValueUnderMainBranch' => {},
'anIntegerValueUnderMainBranch' => {}
}
};
$VAR3 = 'SecondMainBranch';
$VAR4 = {
'' => {
'namedStructAgain' => {
'First' => {},
'Second' => {}
}
}
};
Any help will be appreciated. Thank you in advance.
Edit: In relation with Sobrique's comment - X Y Problem:
Here is the example string I want to parse:
(1,2,"N/A",-1,"foo","bar",NULL,3,2016-03-18 08:12:00.000,2016-03-18 08:12:00.559,2016-03-18 08:12:00.520,0,0,NULL,"foo","123456789",{NULL,NULL,NULL,NULL,NULL,NULL,2016-04-17 11:59:59.999,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,null,NULL,NULL,NULL,NULL,3,0,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,T,0,NULL,NULL,NULL,"9876543210",NULL,"foo","0","bar","foo","a1820000264d979c","0,0",NULL,"foo","192.168.1.82","SOAP",NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL},{INPUT="bar"},{aStringValueUnderMainBranch="ET", aList[{"first", "second", "third"}, {"first", "second", "third"}], secondList[{"first", "second"}, {"first", "second"}],namedStruct{thirdList[{first},{first}]}},{namedStructAgain{"first", "second"}},NULL,NULL,NULL,NULL,NULL)
Somehow I should separate all values and after that to identify this part:
{aStringValueUnderMainBranch="ET", aList[{"first", "second", "third"}, {"first", "second", "third"}], secondList[{"first", "second"}, {"first", "second"}],namedStruct{thirdList[{first},{first}]}}
as FirstMainBranch and parse the corresponding values as showed in the XML. After that I should identify:
{namedStructAgain{"first", "second"}}
as SecondMainBranch and get the respective values. There is an additional problem here with the primary data separation I should not take in mind the commas when they are between parentheses.