2

I am trying to find a consistent solution using XSLT to transform huge XML files (Almost 5 GB)

Hier is what I have tried until now:

  1. Using the MSXML Parser 4.0 (SP3) from the command line:

>msxsl.exe myfile.xml mysheet.xslt -o output.xml

This runs out of memory (Code: 0x8007000e) with files bigger than 800MB.

  1. Using Mozilla Firefox or IE applying XSLT through a Processing Instruction:

<?xml version='1.0' encoding='UTF-8'?>

<?xml-stylesheet href="mysheet.xslt" type="text/xsl" ?>

<root>...

The browser crashes after a couple of minutes.

  1. Trying to write my own XML-Reader in PHP (Ver 5.4.22) on Windows and selecting the elements I need with XPath

<?php

ini_set('max_execution_time', 0);

ini_set('memory_limit', '-1');

$xml = simplexml_load_file('myfile.xml');

foreach($xml->xpath('/root/node/atribute[@id="value"]') as $result){

...

... ...

}

... ... ...

The Apache server crashes.

Please tell about your experiences in this area... What about writing a class in Java?

P.S. I don´t want to use software like XmlSplit or something!

  • How about any XSLT processor built for 64-bit on a machine with enough RAM? Not many other options, I'd bet, as due to the nature of XSLT, most processors are written to load the entire document into a DOM tree. Otherwise, I'd say trash XSLT. You need to use a different way which doesn't involve loading the entire document into memory before processing. – Dark Falcon Oct 28 '15 at 19:19
  • what about splitting the file in to chunks. –  Oct 28 '15 at 19:22
  • How much RAM is on the system you're using? Have you tried MSXML on a system with at least 8 gigs of RAM? – ebyrob Oct 28 '15 at 19:23
  • @Dark Falcon which XSLT processor would you recommend? I have 8GB RAM on my computer and trying to process a XML file of 5GB doesnt work! Do you know why? – Gunther von Goetzen Sanchez Oct 28 '15 at 19:25
  • 4
    No XSLT processor. I would recommend a different technology, such as [STX](http://stx.sourceforge.net/) or a custom program based on a SAX XML parser. I would imagine that what you have tried so far didn't work because 8 GB is not enough. The parsed form will be larger than the raw form, probably by a significant amount. – Dark Falcon Oct 28 '15 at 19:27
  • @Dagon splitting the file would comsume time... I need to process 100 of files! In the future could be even 10GB. Thats no solution for me – Gunther von Goetzen Sanchez Oct 28 '15 at 19:28
  • I agree, you don't have the choice, you ou have to process the xml as stream. – Mr_Thorynque Oct 28 '15 at 19:29
  • @ebyrob yes MSXML 4.0 with 8 GB RAM – Gunther von Goetzen Sanchez Oct 28 '15 at 19:29
  • 2
    http://stackoverflow.com/questions/3101048/xslt-transformation-on-large-xml-files-with-c-sharp is a similar thread which mentions http://saxon.sourceforge.net/ for larger file manipulation. – ebyrob Oct 28 '15 at 19:32
  • @Dark Falcon The STX approach might be what I have been looking for! Im reading about that! – Gunther von Goetzen Sanchez Oct 28 '15 at 19:52

1 Answers1

5

For a 5Gb source document you'll need a streaming processor, and that means XSLT 3.0, which currently has two implementations, Saxon-EE and Exselt. Of course, not all transformations are streamable (sorting is tricky, for example), but if you describe the transformation you want to perform, or give a non-streaming version of it, then I'm sure we can help you to turn into something that works under streaming.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • 1
    I've used Saxon to process 2GB xml files with unbelievable speed results. Forget trying to load it all into memory, it's really not a good approach for large files. Streaming is the way whenever it is possible, which is usually most of the time with a little thought. – CodeCabbie Mar 21 '18 at 16:22