2

I have to process a huge XML file (>10 GB) to convert it to CSV. I am using XML::Twig.

The file contains data of around 2.6 million customers, each of which will have around 100 to 150 fields (depends on customers profile).

I store all the values of one subscriber in hash %customer, and when processing is done I output the values of the hash to a text file in CSV format.

The issue is the performance. It takes around 6 to 8 hours to process it. How it can be reduced?

my $t = XML::Twig->new(
  twig_handlers => {
    'objects/simple'   => \&simpleProcess ,
    'objects/detailed' => \&detailedProcess ,
  },
  twig_roots => { objects => 1}
);

sub simpleProcess {
  my ($t, $simple) = @_;

  %customer= (); #reset the hash
  $customer{id}  = $simple->first_child_text('id');
  $customer{Key} = $simple->first_child_text('Key');
}

The detailed tags includes several fields, including nested fields. So I call up a function every time for collecting different types of fields.

sub detailedProcess {
  my ($t, $detailed1) = @_;

  $detailed = $detailed1;
  if ($detailed->has_children('profile11')){ &profile11();}
  if ($detailed->has_children('profile12')){ &profile12();}
  if ($detailed->has_children('profile13')){ &profile13();}
}
sub profile11 {
  foreach $comcb ($detailed->children('profile11')) {
    $customer{COMCBcontrol} = $comcb->first_child_text('ValueID');
  }

The same goes for other functions *(value2, value3). I am not mentioning the other functions for keeping it simple.

<objecProfile>
    <simple>
        <id>12345</id>
        <Key>N894FE</Key>
    </simple>
    <detailed>
        <ntype>single</ntype>
        <SubscriberType>genericSubscriber</SubscriberType>
        <odbssm>0</odbssm>
        <osb1>true</osb1>
        <natcrw>true</natcrw>
        <sr>2</sr>
        <Profile11>
            <ValueID>098765</ValueID>
        </Profile11>
        <Profile21>
        <ValueID>098765</ValueID>
        </Profile21>
        <Profile22>
        <ValueID>098765</ValueID>
        </Profile22>
        <Profile61>
            <ValueID>098765</ValueID>
        </Profile61>
    </detailed>
</objectProfile>

Now the question is: I use foreach for every child even though almost every time the child instance occurs only once throughout the customer profile. Could it cause the delay, or are there any other suggestions to improve the performance? Threading etc.? (I googled and found that threading doesn't help much.)

Borodin
  • 126,100
  • 9
  • 70
  • 144
Muzammil
  • 628
  • 1
  • 9
  • 23

2 Answers2

2

I suggest using XML::LibXML::Reader. It is very efficient because it doesn't build an XML tree in memory unless you ask it to, and is based on the excellent LibXML library.

You will have to get used to a different API from XML::Twig, but IMO it is still fairly simple.

This code does exactly what your own code does, and my timings suggested that 10 million records like the one you show will be processed in 30 minutes.

It works by repeatedly scanning for the next <object> element (I wasn't sure if this should be <objecProfile> as your question is inconsistent), copying the node and its descendants to an XML::LibXML::Element object $copy so that the subtree can be accessed, and pulling out the information required into %customer.

use strict;
use warnings;

use XML::LibXML::Reader;

my $filename = 'objects.xml';

my $reader = XML::LibXML::Reader->new(location => $filename)
        or die qq(cannot read "$filename": $!);

while ($reader->nextElement('object')) {

    my %customer;

    my $copy = $reader->copyCurrentNode(1);

    my ($simple) = $copy->findnodes('simple');
    $customer{id}  = $simple->findvalue('id');
    $customer{Key} = $simple->findvalue('Key');

    my ($detailed) = $copy->findnodes('detailed');
    $customer{COMCBcontrol} = $detailed->findvalue('(Profile11 | Profile12 | Profile13)/ValueID');

    # Do something with %customer
}
Borodin
  • 126,100
  • 9
  • 70
  • 144
1

First, use DProf or NYTProf to figure what is slowing down your code. But, i think the main work will be inside XML parser, so my opinion this not be able to increase speed greatly.

As another variant i suggest you to split(not parse), just this XML into pieces(need to save xml format consistency) and run ncpu forks to process each independently, produce some file with agregate values and then process it.

Or, you can transform this XML into something that is parseable without XML parser. For example: seems you need id, Key, ValueID fields, so you can remove "\n" in input file and produce some other file, with one objectProfile per line. Then, feed each line to the parser. This can allow you to use multithread processing of one file, so you will use all CPUs. Probably string </objectProfile> can work as record separator. Need to study format of your xml to make decision.

P.S. Someone will want to downvote me with "parsing XML by yourself is bad" or some links like this. But, sometimes when you have big highload or very big input data - you had a choice: do it in "lawful" style; or do it in given time with given precision. The users/customers do not care how you do it, they want result.

Community
  • 1
  • 1
Galimov Albert
  • 7,269
  • 1
  • 24
  • 50
  • Actually thats why i get too. XML parsing other than the usual parser is no less than a sin. :) .. but i am thinking to segregate and treat it more like text. *fingers crossed*. Actually splittting it would be a little difficult, due to size. same goes for removing the \n. :( On a side note, i am totally familiar with the xml format. I can segregate it (like one function keeps on reading the file, feed the whole block to xml parser as string) and we can fork multiple parsers (say 4, i have 4 cpus) ?? – Muzammil Mar 02 '13 at 09:21
  • I once got a Perl program to run 100 times faster by converting it to use a real XML parser rather than trying to do the parsing "by hand". – Michael Kay Mar 02 '13 at 09:22
  • @MichaelKay , any suggestion for the real parser?? would be much appreciated. I am using xml::twig. and it is supposed to be pretty good. – Muzammil Mar 02 '13 at 09:25
  • @Muzammil you can divide file length to 4, so you will get 4 chunks. Now, for each chunk read file at offset where chunk ends and find ``. Adjust chunk end to it and next chunks start too. Now you have almost consistent XML in each chunk, so you can process each in separate fork, so 4 times faster. The problems is: i/o bottleneck and probably fail-parsing. – Galimov Albert Mar 02 '13 at 09:30
  • @PSIAlt , got it. I guess i ll do some tweaking for split and get it right. Ok. a little two liners for executing the fork would be helpful. I am having difficulty forking. lets say we have to do 4 forks. – Muzammil Mar 02 '13 at 10:22
  • @Muzammil example script i used to process >40gb logs https://gist.github.com/PSIAlt/498a475b5bb6126a52dd – Galimov Albert Mar 02 '13 at 10:43
  • @Muzammil: Take a look at my answer using `XML::LibXML::Reader` which seems to be very much faster than your experience with `XML::Twig`. I have tried both modules, and got a time of 5.5 hours to parse 10 million records like yours using `XML::Twig`, but using `XML::LibXML::Reader` brought it down to 30 minutes. – Borodin Mar 02 '13 at 18:48