I have to process a huge XML file (>10 GB) to convert it to CSV. I am using XML::Twig
.
The file contains data of around 2.6 million customers, each of which will have around 100 to 150 fields (depends on customers profile).
I store all the values of one subscriber in hash %customer
, and when processing is done I output the values of the hash to a text file in CSV format.
The issue is the performance. It takes around 6 to 8 hours to process it. How it can be reduced?
my $t = XML::Twig->new(
twig_handlers => {
'objects/simple' => \&simpleProcess ,
'objects/detailed' => \&detailedProcess ,
},
twig_roots => { objects => 1}
);
sub simpleProcess {
my ($t, $simple) = @_;
%customer= (); #reset the hash
$customer{id} = $simple->first_child_text('id');
$customer{Key} = $simple->first_child_text('Key');
}
The detailed tags includes several fields, including nested fields. So I call up a function every time for collecting different types of fields.
sub detailedProcess {
my ($t, $detailed1) = @_;
$detailed = $detailed1;
if ($detailed->has_children('profile11')){ &profile11();}
if ($detailed->has_children('profile12')){ &profile12();}
if ($detailed->has_children('profile13')){ &profile13();}
}
sub profile11 {
foreach $comcb ($detailed->children('profile11')) {
$customer{COMCBcontrol} = $comcb->first_child_text('ValueID');
}
The same goes for other functions *(value2, value3). I am not mentioning the other functions for keeping it simple.
<objecProfile>
<simple>
<id>12345</id>
<Key>N894FE</Key>
</simple>
<detailed>
<ntype>single</ntype>
<SubscriberType>genericSubscriber</SubscriberType>
<odbssm>0</odbssm>
<osb1>true</osb1>
<natcrw>true</natcrw>
<sr>2</sr>
<Profile11>
<ValueID>098765</ValueID>
</Profile11>
<Profile21>
<ValueID>098765</ValueID>
</Profile21>
<Profile22>
<ValueID>098765</ValueID>
</Profile22>
<Profile61>
<ValueID>098765</ValueID>
</Profile61>
</detailed>
</objectProfile>
Now the question is: I use foreach
for every child even though almost every time the child instance occurs only once throughout the customer profile. Could it cause the delay, or are there any other suggestions to improve the performance? Threading etc.? (I googled and found that threading doesn't help much.)