7

At work, we're using a XML log file. Each log message is a <message> block with <date> and <time> subnodes, there are <submessage> blocks, <table> constructs and so on and the log file can be transformed to a localized HTML using some Delphi processing and XSLT later.

For medium sized log files (around 2 MB), we got performance problems (up to a minute for loading the XML and doing some basic manipulations) and I could reduce them to a test project like this (EDIT: Updated the code and added measurements):

procedure TForm1.PrepareTest(MessageCount : integer);
var
  XML : IXMLDocument;
  i : integer;
begin
  XML := NewXMLDocument;
  XML.DocumentElement := XML.CreateNode('root');
  for i := 1 to MessageCount do
  begin
    XML.DocumentElement.AddChild('message').Text := 'Test Text';
  end;
  XML.SaveToFile(XML_NAME);
end;

procedure TForm1.XMLTest;
var
  StartTime : Cardinal;
  XML : IXMLDocument;
begin
  StartTime := GetTickCount();
  XML := NewXMLDocument;
  XML.LoadFromFile(XML_NAME);
  Memo1.Lines.Add('Node count: ' + IntToStr(XML.DocumentElement.ChildNodes.Count));
  Memo1.Lines.Add('Time: ' + FloatToStr((GetTickCount() - StartTime) / 1000) + ' seconds');
end;

This results in the following time measurements (node count is increased by 25% per column, all times in milliseconds):

Node count      8000    10000   12500   15625   19531   24413   30516   38145   47681
Base test time  484     781     1140    1875    2890    4421    6734    10672   16812
Variation 1                             32      47      62      78      78      141
Variation 2     2656    3157    3906    5015    6532    8922    12140   17391   24985
  (delta Base)   2172    2376    2766    3140    3642    4501    5406    6719    8173

Note the two variations, the first is LoadFromFile only and the second one is additionally adding 10000 more nodes to the beginning (!) of the XML like PrepareTest does, which is worst case but looking at the delta to the base test, even this doesn't show quadratic effects. Also note that counting the nodes can be replaced with any other operation, so it looks like there is some delayed initialization/validation of the XML file involved that causes the problem and any manipulation afterwards shows expected behaviour.

Memory usage is not high, the last test case (47681 nodes) has a peak memory usage of 39 MB, its XML filesize is 1.3 MB.

The first thing that is done after loading the XML (e.g. reading or writing some nodes or accessing node count) is slow and it shows quadratic runtime behaviour, so any log file above 10 MB is unusable.

We already solved the performance problems along with some other things by parsing small chunks of 100 messages, and I'm aware that the Delphi XML routines are not suited/overkill for this use case - using a different XML library would most likely stop the performance problems. So I'm not asking for a solution for the problem (although it would be interesting to know if the problem can be solved without using a different XML library).

My question is: What is the reason for the quadratic runtime behaviour of the Delphi XML routines, respectively MSXML? I can't imagine things that would lead to this in XML loading/parsing/validating, besides really "stupid" things like managing the nodes in a linked list instead of a tree, but I might be overlooking something, perhaps DOM related.

schnaader
  • 49,103
  • 10
  • 104
  • 136
  • 5
    This isn't really a Delphi problem. The XML routines in question are simply an interface to Windows's standard `MSXML` library. – Mason Wheeler Jan 16 '13 at 19:14
  • 5
    XML is not good for logging. The best option you have is to open the file, locate the last tag (hopefully automatically by known size, append your new nodes, and then append the last tag again. Otherwise you are onto a loser. – mj2008 Jan 16 '13 at 19:15
  • @MasonWheeler: Thanks for pointing this out, though I'm aware of it and it just changes the scapegoat, so the question remains the same. – schnaader Jan 16 '13 at 19:21
  • 1
    @mj2008: Appending new messages is done like you described and the log file is only loaded as a whole when converted to HTML, so the problem doesn't affect the main logging process. Still, XML has advantages here as mentioned, e.g. XSLT is very useful and everything used in the workflow is "UTF ready". – schnaader Jan 16 '13 at 19:27
  • have you tried to measure time of this code? I think the most of time is spent for `LoadFromFile()`. I recommend you to look at `NativeXML` and `OmniXML` parsers. – teran Jan 16 '13 at 19:36
  • 1
    @teran: I've done measurements some months ago and I might edit the question tomorrow after reviewing them, but AFAIR a single `LoadFromFile()` call without further XML modifications was very fast and it only gets slow when adding modifications (and only the first modification is slow), so it seems to use delayed initialization/validation. – schnaader Jan 16 '13 at 19:49
  • 3
    @schnaader DOM based APIs are always easy to work with, but slow. On the other hand, SAX based APIs are quite tricky to work with, but fast. – David Heffernan Jan 16 '13 at 19:56
  • Updated the question with complete code and time measurements, also changed "exponential" to "quadratic" for clarification. – schnaader Jan 17 '13 at 11:03

4 Answers4

7

I agree with mj2008 that XML isn't a good fit for logging. That said, this problem and large XML files in general, could get processed faster by using SAX, it throws events while parsing the incoming stream of XML data, which lets you process the items as they are read from disk, in effect mitigating the exponentiality of loading it all in memory before handing it to XSLT.

I regret I have not done SAX in Delphi (yet), but I suspect the hardest part is implementing the required SAX interfaces (e.g. ISAXContentHandler), but Delphi has TInterfacedObject and TAutoObject and other just for that.

Stijn Sanders
  • 35,982
  • 11
  • 45
  • 67
  • This answer confuses me even more. You say SAX improves performance which I understand perfectly, but why does it still have exponential runtime when loading? What's the reason parsing the second half of X nodes takes twice as long as the first half? I would perhaps understand it for a worst case XML tree (X nested nodes), but for the simple structure of a log file (X seperate message nodes), non-linear runtime seems wrong. – schnaader Jan 16 '13 at 21:15
  • SAX parsing will scale linearly with input file size – David Heffernan Jan 16 '13 at 21:57
  • I'll just trust David for now and +1 and accept your answer since SAX would've been one of the things I'd have tried next. If it also has exponential behaviour, I'll post a new question or edit this one, but I doubt it. Thanks! – schnaader Jan 16 '13 at 22:43
4

Your question in a nutshell: Why is the binary library MSXML so slow? Who knows. Who cares. Are you going to disassemble it? Break into Microsoft and grab its source code? This isn't Delphi here, this is microsoft code.

While XML is a bad choice for logging, OmniXML is probably a better choice than MSXML.

However, a far better choice is called "open text file for append, write line, close text file". Notice the inherent scalability, and the lack of parsing required.

Warren P
  • 65,725
  • 40
  • 181
  • 316
  • +1 because you're absolutely right, and what I've done to fix the problems goes in these directions. I just was curios if someone here at SO perhaps knows the answer. btw, I'm pretty close to inspecting/disassembling or at least doing some more research on the topic indeed, because I'm interested in why/how they screwed MSXML up so bad. – schnaader Jan 16 '13 at 22:50
3

Contrary to the comments of others, I think XML is an excellent format for logging. The Delphi VCL wrappers for XML are very core memory greedy, so this may explain the poor performance of pure TXMLDocument processing at scale.

I would instead recommend posting to your XML log by using a simple XSLT transform. I haven't measured the performance of this solution at scale, but I believe it will be a vast improvement over what you are currently reporting.

The style-sheet.

For example, suppose our log looks something like this...

<log>
  <message>This is the first message<message/> 
</log>  

This simple XSLT 1.0 style-sheet, with parameter addend-message ...

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" omit-xml-declaration="yes" />
<xsl:strip-space elements="*" />  
<xsl:param name="addend-message" select="''" />

<xsl:template match="@*|node()">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()" />
  </xsl:copy>
</xsl:template>  

<xsl:template match="log">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()" />
    <message><xsl:value-of select="$addend-message" /></message>
  </xsl:copy>
</xsl:template>  

... will append a message to the log.

The language binding

To achieve this in Delphi, use the following declarations...

ITransform = interface
  procedure AddParameter( const sParamBaseName, sParamValue, sNamespaceURI: string);
  procedure Transform;
  property  InputDocumentFileName : string;
  property  OutputDocumentFileName: string;
end;

function MS_Transform( const sStylesheet: string): ITransform;

Create an ITransform, passing in the stylesheet as a string. Set both filename properties to the filename of the log. Every time you need to add a message to the log, call AddParameter() and then Transform().

Solution implementation details

One possible implementation for the aforementioned language binding might be ...

uses XMLIntf, msxml, msxmldom, sysutils;

type
  ITransform = interface
    ['{1004AE9A-D4AE-40E1-956D-AD98801AF7C1}']
      procedure SetInputDocumentFileName ( const sValue: string);
      procedure SetOutputDocumentFileName( const sValue: string);
      procedure AddParameter( const sParamBaseName, sParamValue, sNamespaceURI: string);
      procedure Transform;

      property InputDocumentFileName : string    write SetInputDocumentFileName;
      property OutputDocumentFileName: string    write SetInputDocumentFileName;
    end;

    TMS_XSLT = class( TInterfacedObject, ITransform)
    private
      FStylesheet: IXSLTemplate;
      FStylesheetAsDoc: IXMLDOMDocument2;
      FInputFN, FOutputFN: string;
      FProcessor: IXSLProcessor;;

      procedure SetInputDocumentFileName ( const sValue: string);
      procedure SetOutputDocumentFileName( const sValue: string);
      procedure MakeProcessor;

    publc
      constructor Create( const sStylesheet: string);
      procedure AddParameter( const sParamBaseName, sParamValue, sNamespaceURI: string);
      procedure Transform;

      property InputDocumentFileName : string    write SetInputDocumentFileName;
      property OutputDocumentFileName: string    write SetInputDocumentFileName;
    end;

  function MS_Transform( const sStylesheet: string): ITransform


function MS_Transform( const sStylesheet: string): ITransform
begin
result := TMS_XSLT.Create( sStylesheet)
end;

constructor TMS_XSLT.Create( const sStylesheet: string);
begin
  FStyleSheet := msxml.CoXSLTemplate60.Create;
  FStylesheetAsDoc := msxml.CoFreeThreadedDOMDocument60.Create;
  FStylesheetAsDoc.loadXML( sStyleSheetContent);
  FStylesheet.stylesheet := FStylesheetAsDoc  
end;

procedure TMS_XSLT.MakeProcessor;
begin
if not assigned( FProcessor) then
  FProcessor := FStylesheet.createProcessor
end;

procedure TMS_XSLT.SetInputDocumentFileName( const sValue: string);
begin
FInputDoc := sValue
end;

procedure TMS_XSLT.SetOutputDocumentFileName( const sValue: string);
begin
FOutputDoc := sValue
end;

procedure TMS_XSLT.AddParameter( const sParamBaseName, sParamValue, sNamespaceURI: string);
begin
MakeProcessor;
FProcessor.addParameter( sParamBaseName, sParamValue, sNamespaceURI)
end;

procedure TMS_XSLT.Transform;
var
  Doc: TXMLDocument;
  DocIntf: IXMLDocument;
  oXMLDOMNode: IXMLDOMNodeRef;
  sOutput: string;
begin
MakeProcessor;
try
  Doc  := TXMLDocument.Create( nil);
  Doc.Options := [doNodeAutoCreate, doNodeAutoIndent, doAttrNull, doAutoPrefix, doNamespaceDecl];
  Doc.DOMVendor := GetDOMVendor( 'MSXML');
  DocIntf := Doc;
  DocIntf.LoadFromFile( FInputFN);
  DocIntf.Active := True;
  if Supports( DocIntf.Node.DOMNode, IXMLDOMNodeRef, XMLDOMNode) then
    FProcessor.input := XMLDOMNode.GetXMLDOMNode;
  FProcessor.transform;
  while oProcessor.readyState <> 4 do sleep(1);
  sOutput := FProcessor.output;
  if sOutput = '' then exit;
  WriteToFile( sFOutputFN, sOutput);
  // Alternate way..
  //  Doc  := TXMLDocument.Create( nil);
  //  Doc.Options := [doNodeAutoCreate, doNodeAutoIndent, doAttrNull, doAutoPrefix, doNamespaceDecl];
  //  Doc.DOMVendor := GetDOMVendor( 'MSXML');
 //   DocIntf := Doc;
  //  DocIntf.LoadFromXML( sOutput);
  //  DocIntf.Active := True;
  //  DocIntf.SaveToFile( FOutputFN)
finally
  FProcessor := nil
  end
end;

This binds to Microsoft's MS XML library and XSLT engine. Regrettably, I do not know of any convenient way to bind Saxon's XSLT processor to Delphi code.

Alternative implementation

An alternative implementation for leveraging MS's XSLT engine is given by my answer here. The disadvantage of this method is that parameterisation is not native. To parameterise the style sheet, your would have to roll your own, by doing a string replace on the stylesheet before transformation.

Performance considerations

If you are doing a lot of logging quickly, it may be a good strategy to cache messages to be logged in memory, and then at regular but not too frequent intervals, purge your cache with a single XSLT transform to write all your messages.

Community
  • 1
  • 1
Sean B. Durkin
  • 12,659
  • 1
  • 36
  • 65
  • +1, slightly offtopic as the logging itself is not my problem, but this is a good way to do the main logging work. – schnaader Jan 17 '13 at 11:19
  • +1 for crazy idea. I still think this suffers from too much complexity. Text files are better. – Warren P Jan 17 '13 at 18:23
1

Are you using the term "exponential" in a mathematical sense, or just in a popular sense? It would be interesting to know for example whether it's really quadratic, or whether it's some kind of function where the performance is fairly linear until you hit some threshold (memory size) at which point it suddenly degrades.

If it's taking a minute to process 2Mb then something is very badly wrong indeed. I don't know your environment well enough to start making guesses, but this should take at most a second. You need to drill down to find where the time is going. Start by establishing whether it's spending its time parsing the XML, or processing the XML after parsing is complete.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • I'll update the question with complete code, runtimes and memory usage soon. I use "exponential" in a mathematical sense, e.g. for a "base" XML 10 KB in size, processing 20 KB takes around 4 times as long, 40 KB takes around 16 times as long and so on. But your question still is valid, as I only checked 3-5 values last time and haven't done a complete runtime analysis. – schnaader Jan 17 '13 at 08:07
  • That would be quadratic scaling. Are you by any chance using bubble sort? – David Heffernan Jan 17 '13 at 08:25
  • @DavidHeffernan: I'm not doing anything that would explain quadratic behaviour, that's why I'm so interested in the reason or what MSXML does wrong here. I can even reduce the problem to two API calls, loading a XML and counting its nodes. – schnaader Jan 17 '13 at 09:51
  • Which part of that is quadratic? Loading or counting? – David Heffernan Jan 17 '13 at 10:02
  • Counting, but it doesn't depend if I count the nodes or do any other operation on the XML, it seems like there is some delayed initialization/validation going on and operations after that are showing expected "normal" behaviour. – schnaader Jan 17 '13 at 11:08