2

I have a gigantic XML file (around 10 Gb) which i need to convert to CSV. Now this file would have information about numerous customers. I have to convert it a CSV format. The problem is that many customers will have extra fields which other customers wont, and some of the fields will be repeated. The example of XML is:

<customer>
<customerID>1</customerID>
    <auc>
        <algoId>0</algoId>
        <kdbId>1</kdbId>
        <acsub>1</acsub>
    </auc>
</customer>

<customer>
<customerID>2</customerID>
    <auc>
        <algoId>0</algoId>
        <kdbId>1</kdbId>
        <acsub>1</acsub>
        <extraBit>12345</extraBit>
    </auc>
    <auc>
        <algoId>2</algoId>
        <kdbId>3</kdbId>
        <acsub>3</acsub>
        <extraBit>67890</extraBit>
    </auc>
        <customOptions>
            <odboc>0</odboc>
    <odbic>0</odbic>
    <odbr>1</odbr>
    <odboprc>0</odboprc>
    <odbssm>0</odbssm>
</customOptions>
</customer>

Now as you can see the First customer has only 1 auc block, but second one has 2, moreover it also has a extra tag in auc which is extraBit. Now the questions:

  1. I should process one customer at a time (from one customer to /customer, and then so on) as 10 Gb atonce will crash the system.

  2. I try to use XML TWIG in a loop and when i try to extraBit for Customer 1, it terminates the program for 'undefined value':

    print $customer->first_child('extraBit')->text()

    Can't call method "text" on an undefined value at xml-tags.pl line 50.

  3. For the extra auc values for customer I want them to be output in the CSV file as:

    customerID,algoId,kdbId,acsub,extraBit,algoId2,kdbId2,acsub2,extraBit2

    1,0,1,1,,,,,,

    2,0,1,1,1234,2,3,3,67890

Muzammil
  • 628
  • 1
  • 9
  • 23
  • I somehow got the feeling that you should rather accomplish this with an XSLT processor. Take a look at http://xmlsoft.org/XSLT/xsltproc2.html and read a little about XSLT. Basic processing is quite easy really. http://stackoverflow.com/questions/7294344/convert-xml-to-csv-using-xslt – simbabque Feb 26 '13 at 15:22
  • And please show some more of your Perl code. It's hard to understand what you already have. – simbabque Feb 26 '13 at 15:23
  • @simbabque on the contrary, using XSLT would require parsing the whole 10GB file into memory in one go. Using `XML::Twig`, and in particular the `twig_handlers` mechanism, means you can step through the customers one by one in a streaming fashion. – Ian Roberts Feb 26 '13 at 16:04
  • I was certain it would do the same with XSLT. I haven't used it much though. Thanks for correcting me. :) – simbabque Feb 26 '13 at 16:16

1 Answers1

3
print $customer->first_child('extraBit')->text()

you can avoid the undefined error by using first_child_text instead, which is defined to return an empty string if no matching child element can be found.

print $customer->first_child_text('extraBit')

The complete code would be something like

my $t= XML::Twig->new(
  twig_handlers => { customer => \&process_customer });
$t->parsefile('file.xml');

sub process_customer {
  my ($t, $customer) = @_;
  print $customer->first_child_text('customerID');
  foreach my $auc ($customer->children('auc')) {
    print ',', $auc->first_child_text('algoId'),
          ',', $auc->first_child_text('kdbId'),
          ',', $auc->first_child_text('acsub'),
          ',', $auc->first_child_text('extraBit');
  }
  print "\n"
  $customer->purge;
}
Ian Roberts
  • 120,891
  • 16
  • 170
  • 183
  • Thanks @Ian Roberts, it helped a lot. I started using twig today, so a little perplexed. Your code worked perfectly but if i have more than 1 customer block, it fails with `junk after document element at line 10, column 0, byte 158 at \lib\XML\Parser.pm line 187`. Secondly I edited the XML file contents a bit too (sorry, a little late - pls have a look at xml above). – Muzammil Feb 26 '13 at 17:21
  • @Muzammil The XML still needs to be well formed, which in particular means that it needs a _single_ root element - if your file is a series of `` elements one after the other with no single root element then you'll have to wrap a root element around the whole file (i.e. add `` at the top and `` at the bottom). – Ian Roberts Feb 26 '13 at 17:25
  • @Muzammil and you may need to guard some of your `children` calls with an appropriate `if($customer->has_children('...'))` to cover cases where they might be absent. – Ian Roberts Feb 26 '13 at 17:34
  • Thanks @Ian, Got that right by adding top and bottom tags. Just a little more thing. What about nested children? `Customer->customOption->customType` children. How can we handle that? – Muzammil Feb 26 '13 at 17:40