8

Edit: My (incomplete and very rough) XmlLite header translation is available on GitHub

What is the best way to do a simple combine of massive XML documents in Delphi with MSXML without using DOM? Should I use the COM components SAXReader and XMLWriter and are there any good examples?

The transformation is a simple combination of all the Contents elements from the root (Container) from many big files (60MB+) to one huge file (~1GB).

<Container>
    <Contents />
    <Contents />
    <Contents />
</Container>

I have it working in the following C# code using an XmlWriter and XmlReaders, but it needs to happen in a native Delphi process:

var files = new string[] { @"c:\bigFile1.xml", @"c:\bigFile2.xml", @"c:\bigFile3.xml", @"c:\bigFile4.xml", @"c:\bigFile5.xml", @"c:\bigFile6.xml" };

using (var writer = XmlWriter.Create(@"c:\HugeOutput.xml", new XmlWriterSettings{ Indent = true }))
{
    writer.WriteStartElement("Container");

    foreach (var inputFile in files)
        using (var reader = XmlReader.Create(inputFile))
        {
            reader.MoveToContent();
            while (reader.Read())
                if (reader.IsStartElement("Contents"))
                    writer.WriteNode(reader, true);
        }

    writer.WriteEndElement(); //End the Container element
}

We already use MSXML DOM in other parts of the system and I do not want to add new components if possible.

carlmon
  • 396
  • 6
  • 20
  • 1
    So you want to use SAX to avoid consuming a few gigs of RAM? Does this SAX-with-MSXML demo help? http://keith-wood.name/DelphiXML/BookCode/Chapter%2013/index.html – Warren P Aug 04 '11 at 14:22
  • Yes, Delphi compiles 32-bit only and the DOM-based TXMLDocument wrapper for MSXML chokes with EOutOfMemory when documents reach ~100MB. – carlmon Aug 04 '11 at 14:27
  • My opinion is drop MSXML completely, and go with OmniXML. :-) You should be able to load a 1 gig XML file into a 32 bit process, in any sanely designed XML engine. – Warren P Aug 04 '11 at 14:30
  • This is a big enterprise system and we already use MSXML. Adding/switching components is a whole new problem ITO dependencies, testing, and training... That is if I can convince our architect to buy in. – carlmon Aug 04 '11 at 15:35
  • I've always preferred to build a working solution and then later let the people who think they are in control of this find a way to rationalize the fact that the crap we had sucked, and the new stuff is boss, and then rewrite their internal bikeshed documentation to match reality. Enterprise = Lots of panties in a knot over how bad it would be if anything bad happens. :-) – Warren P Aug 04 '11 at 18:28
  • 2
    @warren SAX is the way to go for large data. DOM blows chunks for large data in 32 bit address space. – David Heffernan Aug 04 '11 at 20:49
  • I tried OmniXML, but it also chokes very quickly. – carlmon Aug 05 '11 at 07:16
  • Okay, I hope you can find some stable SAX code. I would have thought MSXML SAX would be just as broken as MSXML (and I'm guessing it is?) – Warren P Aug 05 '11 at 11:46
  • Updated XMLLite declarations: https://github.com/the-Arioch/Delphi-XmlLite/commit/1713b1cb33fe8965f1b4e009255365ba22e24dac – Arioch 'The Oct 04 '16 at 11:43
  • I don't know if kluug's semi-commercial OXML would do better - but he does not answers mails so it is no option anyway. OmniXML is problematic for somewhat large files (I added a pseudo-answer below). For small XML files I usually use SuperObject lib, it is easy for lazy using :-) – Arioch 'The Oct 04 '16 at 12:20

4 Answers4

3

XmlLite is a native C++ port of xml reader and writer from System.Xml, which provides the pull parsing programming model. It is in-the-box with W2K3 SP2, WinXP SP3 and above. You'll need a Delphi header translation before almost 1-1 mapping from C# to Delphi.

Samuel Zhang
  • 1,290
  • 8
  • 14
  • 1
    the Delphi/Object Pascal persistence framework tiOPF (http://wiki.freepascal.org/tiOPF) supports XmlLite so I guess this open source project already includes the header translations – mjn Aug 07 '11 at 08:59
  • Thanks Samuel, MS XmlLite works well! tiOPF seems to have something else called XmlLite (or I could not find the unit), so I wrote my own header translation for the bits I needed. – carlmon Aug 11 '11 at 15:12
  • 1
    @carlmon: maybe you could share your header translation? – jpfollenius Sep 16 '13 at 12:46
  • @Smasher It is very rough, but I created a repo: https://github.com/GenasysTechnologies/Delphi-XmlLite – carlmon Oct 30 '13 at 05:33
  • 1
    @carlmon I fixed some declarations there, hopefully win64 ready now. Additionally I think about no more caring about pre-2010 Delphi and pre-2.6.0 FPC. See comments at https://github.com/the-Arioch/Delphi-XmlLite/commit/1713b1cb33fe8965f1b4e009255365ba22e24dac – Arioch 'The Oct 04 '16 at 11:41
1

I'd just use regular file I/O to writeln a to a text file, writeln each of the contents as a string, and finally writeln . If you had a more reasonable size, I'd assemble everything in a stringlist and then stream that to disk. But if you're into GB territory, that would be risky.

Chris Thornton
  • 15,620
  • 5
  • 37
  • 62
  • 1
    Surely the delphi SAX-with-MSXML thing is functional though? – Warren P Aug 04 '11 at 14:24
  • I may resort to this, but I forgot to mention one variable-sized header element in the files that need to be ignored for the output. It makes straight filestream a bit hacky... – carlmon Aug 04 '11 at 14:32
  • 1
    Resorting to this rather than using a tested working SAX parser would be silly. (I won't use new components, unless I invent them from scratch?) – Warren P Aug 04 '11 at 18:44
1

libxml with the Delphi wrapper Libxml2 might be an option (found here), it has some SAX support and seems to be very solid - the web page mentions that libxml2 passed all 1800+ tests from the OASIS XML Tests Suite. See also: Is there a SAX Parser for Delphi and Free Pascal?

Community
  • 1
  • 1
mjn
  • 36,362
  • 28
  • 176
  • 378
  • I wrote my own LibXML wrapper for Delphi 5 a few years ago, but we standardized on MSXML in newer Delphi to avoid bloat & dependencies - we were linking or shipping 3 different XML engines at one stage o_O. – carlmon Aug 05 '11 at 07:20
  • So now you're down to 1 and it's the buggiest one and it's part of the OS instead of shipping a known good version with your app. :-) – Warren P Aug 05 '11 at 18:30
0

Posting this as answer because it needs some space and formatting.

I've got one baaad data file for tests see the message at https://github.com/the-Arioch/omnixml/commit/d1a544048e86921983fced67c772944f12cb1427

Here OmniXML kind of sucks in XE2 debug build:

  • About 25% more memory use than TXmlDocument/MSXML. Maybe even more after fixing .NextSibling issue, did not re-test.
  • longer file loading time ( OTOH significantly faster reading node properties: they are already Delphi-typed variables, no crossing of MSXML/Delphi boundary )
  • absolutely no support for namespaces, which makes recognizing tags way harder
  • XPath in embryo state, including yet again lack of namespaces

https://docs.google.com/spreadsheets/d/1QcFVwh3fFfaDyRmv2b-n4Rq4_u5p42UfNbR_FZgZizY/edit?usp=sharing

Arioch 'The
  • 15,799
  • 35
  • 62