-3

I'm looking for a very fast XML parser for Delphi, for very simple data.

Consider the following kind of data:

<node>
    <datatype1>randomdata</datatype1>
    <datatype2>randomdata</datatype2>
    <datatype3>randomdata</datatype3>
    <datatype4>randomdata</datatype4>
    <datatype5>randomdata</datatype5>
    <datatype6>randomdata</datatype6>
    <datatype7>randomdata</datatype7>
    <datatype8>randomdata</datatype8>
    <datatype9>randomdata</datatype9>
    <datatype10>randomdata</datatype10>
    <datatype11>randomdata</datatype11>
    <datatype12>randomdata</datatype12>
    <datatype13>randomdata</datatype13>
    <datatype14>randomdata</datatype14>
    <datatype15>randomdata</datatype15>
    <datatype16>randomdata</datatype16>
    <datatype17>randomdata</datatype17>
    <datatype18>randomdata</datatype18>
    <datatype19>randomdata</datatype19>
    <datatype20>randomdata</datatype20>
</node>

Copy this 10000 times (datatypes and the data being obviously different in a real scenario). Consider also the data contains Unicode.

This will be parsed and loaded into an Array of records like

Type MyData = record
  d1,d2,d3,d4,d5,
  d6,d7,d8,d9,d10,
  d11,d12,d13,d14,d15,
  d16,d17,d18,d19,d20: string;
end;

I wrote a custom parser for this, which in my computer takes approx. 115ms for the entire process, from loading the file to having the 10,000 records filled.

So I am looking for something that can accomplish this faster.

Related questions:

Pos() within utf8 string boundaries

Delphi - Pos() with boundaries

Community
  • 1
  • 1
hikari
  • 3,393
  • 1
  • 33
  • 72
  • 2
    How are you testing this, and how fast do you want it? I mean, 115ms is *pretty good* even if you're simply READING 8Mb from disk, doing nothing to the 8Mb. What good would it do you if you'd be able to parse faster then the HDD can read? – Cosmin Prund Jan 05 '13 at 19:08
  • The times are like this: Load from disk as an Unicode string: 26ms, parse node boundaries: 7ms, load nodes and fill the data records: 76ms. This is in a i5 4Ghz computer, loading the XML file from a ramdisk. The disk loading part does not have any implications here however, I need the parsing process improved. – hikari Jan 05 '13 at 19:15
  • "load nodes" (c) do you need to _load_ nodes? seems you should use SAX model parser, without "real loading" nodes as DOM-model parser does. Anyway, as for me, 155ms is a very good speed. You can try OmniXML or NativeXML parsers, but I don't know if they implement SAX-model – teran Jan 05 '13 at 19:35
  • That is one fast parser, and one fast computer you've got there. I doubt anyone can compete with that... My old i7 @ 3.4Ghz takes 5ms to MOVE (ie the `Move()` command) 8Mb from one location to another in RAM! I really doubt those times can be bitten, so there's no point in providing an answer. – Cosmin Prund Jan 05 '13 at 19:36
  • SAX is always faster than DOM for this task. – David Heffernan Jan 05 '13 at 19:46
  • Tried Omni, seems very bloated and doesn't seem to work with Unicode. The only demo that seemed to work with that example XML took 20-30 times longer than my code. Will check NativeXML next. – hikari Jan 05 '13 at 19:50
  • 5
    This is not a real question. You're not asking people to improve your code since you haven't shown it. You're maybe asking people to invent new code that beats your hidden code, but you haven't provided a spec beyond staying that it needs to parse XML, but if that's the best you can describe, then there are already XML libraries out there, and there's no sense trying to make a new one just to answer this question. Profile your code and improve it yourself. – Rob Kennedy Jan 05 '13 at 19:52
  • I provided specific example data format that needs to be parsed, and asked for the fastest parser to accomplish it, there's no need to provide my parser code in this case. – hikari Jan 05 '13 at 19:54
  • 4
    You provided an example, but then you've said the nodes can differ. How, exactly? If you can't say, then you must assume full XML. If you can be more specific, then you can get a more specialized parser that isn't really XML, but merely XML-like. If that's your need, though, then the question is too localized; nobody else will be interested in having a parser for your special format. – Rob Kennedy Jan 05 '13 at 20:00
  • Also, how are the linked questions related, exactly? What do they have to do with parsing "XML"? – Rob Kennedy Jan 05 '13 at 20:02
  • Tried NativeXml, took 1500ms: http://pastebin.com/DSVLcNLn – hikari Jan 05 '13 at 20:17
  • @Rob, the data format will be always like that: a root node, in this case "node", and several subnodes under it, which can have different names, like Year, Name, Whatever, and then some data inside to read; then next node with also the name "node" and the same named subnodes under it, etc. – hikari Jan 05 '13 at 20:19
  • The questions are related, because my original question was about improving a very specific part of my parser, but it got derived into something else and I was suggested to ask a separate question with more specific goals, so this is it. – hikari Jan 05 '13 at 20:20

2 Answers2

9

First let me tell you that you're optimizing the wrong thing here: unless you're doing this for recreational purposes, then your approach is wrong. XML is not a difficult format but it does have it quirks and it takes it's liberties. It's a format designed for data exchange between foreign applications, so the emphasis needs to be put on COMPATIBILITY, not on SPEED! What good is a non-standard ultra-fast parser that gives the wrong result when confronted with a slightly altered XML file?

If you can find a XML parsing LIBRARY that's guaranteed to be compatible with anything out there that can parse your data at HALF the speed your HDD can read it, then simply implement a producer-consumer multi-threaded application where one thread constantly reads the data from disk while the other two simply do the parsing. In the end you'll only be limited by the speed of the HDD while maintaining compatibility. If you're only looking for speed you're liable to make mistakes, skip XML features, depend on certain particularities of the sample XML file you're dealing with. Your application is likely to break for numerous reasons.

Remember that the most costly cycle for an application is MAINTENANCE, not production. What you might gain today by making a 50% faster thingy that's 200% percent more difficult to maintain will be lost in a year or so, when computers get 50% faster (nulling your edge over the competition). Besides, there's no point in exceeding natural limits for such processes, like the speed of the HDD. It's irrelevant that you're testing with a file from a RAM-drive - when the application goes into production it will be used with files from a HDD, and your application's performance will be limited by the speed of your HDD.


Anyhow, I do like a challenge once in a while and I really like parsers. What follows is a very simple parser implementation that only looks at each character in the input string once and only copies stuff where needed: copies the name of the nodes in order to decide what to do next and copies the node's "Payload" when appropriate, in order to push it into the array. On my "modest" i7 @ 3.4 Ghz parsing a string built by copying your sample data 10,000 times takes 63 ms. It clearly beats your time, BUT a word of warning, this code is fragile: it depends on having a XML file that's a certain form. No way around that.

program Project28;

{$APPTYPE CONSOLE}

uses SysUtils, DateUtils, Windows;

const SampleData =
    '<node>'#13#10+
    '  <datatype1>randomdata</datatype1>'#13#10+
    '  <datatype2>randomdata</datatype2>'#13#10+
    '  <datatype3>randomdata</datatype3>'#13#10+
    '  <datatype4>randomdata</datatype4>'#13#10+
    '  <datatype5>randomdata</datatype5>'#13#10+
    '  <datatype6>randomdata</datatype6>'#13#10+
    '  <datatype7>randomdata</datatype7>'#13#10+
    '  <datatype8>randomdata</datatype8>'#13#10+
    '  <datatype9>randomdata</datatype9>'#13#10+
    '  <datatype10>randomdata</datatype10>'#13#10+
    '  <datatype11>randomdata</datatype11>'#13#10+
    '  <datatype12>randomdata</datatype12>'#13#10+
    '  <datatype13>randomdata</datatype13>'#13#10+
    '  <datatype14>randomdata</datatype14>'#13#10+
    '  <datatype15>randomdata</datatype15>'#13#10+
    '  <datatype16>randomdata</datatype16>'#13#10+
    '  <datatype17>randomdata</datatype17>'#13#10+
    '  <datatype18>randomdata</datatype18>'#13#10+
    '  <datatype19>randomdata</datatype19>'#13#10+
    '  <datatype20>randomdata</datatype20>'#13#10+
    '</node>'#13#10;
const NodeIterations = 10000;

type
  TDummyRecord = record
    D1, D2, D3, D4, D5, D6, D7, D8, D9, D10, D11, D12, D13,
      D14, D15, D16, D17, D18, D19, D20: string;
  end;
  TDummyRecordArray = array[1..NodeIterations] of TDummyRecord;

procedure ParseDummyXMLToRecordArray(const InputText:string; var A: TDummyRecordArray);
var PInputText: PChar;
    cPos, TextLen: Integer;
    C: Char;
    State: Integer;

    tag_starts_at: Integer;
    last_payload_starts_at: Integer;
    FlagEndTag: Boolean;

    NodeName, Payload: string;

    cNode: Integer;

const st_not_in_node = 1;
      st_in_node = 2;
begin
  cPos := 1;
  TextLen := Length(InputText);
  PInputText := @InputText[1];
  State := st_not_in_node;
  last_payload_starts_at := 1;
  cNode := 0;

  // This is the lexer/parser loop. It's a finite-state machine with only
  // two states: st_not_in_node and st_in_node
  while cPos < TextLen do
  begin
    C := PInputText[cPos-1];
    case State of

      // What happens when we're NOT currently inside a node?
      // Not much. We only jump to st_in_node if we see a "<"
      st_not_in_node:
        case C of
          '<':
            begin
              // A node starts here. Switch state and set up some simple
              // flags.
              state := st_in_node;
              tag_starts_at := cPos + 1;
              FlagEndTag := False;
            end;
        end;

      // What happens while inside a node? Again, not much. We only care about
      // the "/" - as it signals an closing tag, and we only care about the
      // ">" because that means the end of the ndoe.
      st_in_node:
        case C of
          '/': FlagEndTag := True;
          '>':
            begin
              // This is where the magic haepens. We're in one of possibly two states:
              // We're ither seeing the first <name> of a pair, or the second </name>
              //
              if FlagEndTag then
                begin
                  // This is the closing pair of a tag pair, ie, it's the </NodeName> What we'll do
                  // depends on what node is closing, so we retreive the NodeName:
                  NodeName := System.Copy(InputText, tag_starts_at+1, cPos - tag_starts_at-1);
                  if NodeName <> 'node' then // SAMPLE-DATA-SPECIFIC: I know I don't care about "node" tags.
                  begin
                    // SAMPLE-DATA-SPECIFIC: I know there are only two kinds of nodes:
                    // "node" and "datatypeN". I retreive the PAYLOAD for the node because
                    // I know it's not "ndoe" and I know I'll need it.
                    Payload := System.Copy(InputText,last_payload_starts_at, tag_starts_at - last_payload_starts_at -1);
                    // Make sure we're dealing with a valid node
                    if (cNode > 0) and (cNode <= High(A)) then
                      begin
                        // Based on NodeName, copy the Payload into the appropriate field.
                        if NodeName = 'datatype1' then A[cNode].D1 := Payload
                        else if NodeName = 'datatype2' then A[cNode].D2 := Payload
                        else if NodeName = 'datatype3' then A[cNode].D3 := Payload
                        else if NodeName = 'datatype4' then A[cNode].D4 := Payload
                        else if NodeName = 'datatype5' then A[cNode].D5 := Payload
                        else if NodeName = 'datatype6' then A[cNode].D6 := Payload
                        else if NodeName = 'datatype7' then A[cNode].D7 := Payload
                        else if NodeName = 'datatype8' then A[cNode].D8 := Payload
                        else if NodeName = 'datatype9' then A[cNode].D9 := Payload
                        else if NodeName = 'datatype10' then A[cNode].D10 := Payload
                        else if NodeName = 'datatype11' then A[cNode].D11 := Payload
                        else if NodeName = 'datatype12' then A[cNode].D12 := Payload
                        else if NodeName = 'datatype13' then A[cNode].D13 := Payload
                        else if NodeName = 'datatype14' then A[cNode].D14 := Payload
                        else if NodeName = 'datatype15' then A[cNode].D15 := Payload
                        else if NodeName = 'datatype16' then A[cNode].D16 := Payload
                        else if NodeName = 'datatype17' then A[cNode].D17 := Payload
                        else if NodeName = 'datatype18' then A[cNode].D18 := Payload
                        else if NodeName = 'datatype19' then A[cNode].D19 := Payload
                        else if NodeName = 'datatype20' then A[cNode].D20 := Payload
                        else
                          raise Exception.Create('Unknown node: ' + NodeName);
                      end
                    else
                      raise Exception.Create('cNode out of bounds.');
                  end;
                  // Repeat :-)
                  state := st_not_in_node;
                end
              else
                begin
                  // Node start. Retreive node name. I only care about the start of the "NODE" - if I see that
                  // I'll increment the current node counter so I'll go on filling the next position in the array
                  // with whatever I need.
                  NodeName := System.Copy(InputText, tag_starts_at, cPos - tag_starts_at);
                  last_payload_starts_at := cPos+1;
                  if NodeName = 'node' then Inc(cNode);
                  state := st_not_in_node;
                end;
            end;
        end;
    end;
    Inc(cPos);
  end;
end;

var DataString: string;
    SB: TStringBuilder;
    i: Integer;
    DummyArray: TDummyRecordArray;
    T1, T2, F: Int64;

begin
  try
    try
      // Prepare the sample string; 10.000 iterations of the sample data.
      SB := TStringBuilder.Create;
      try
        for i:=1 to NodeIterations do
          SB.Append(SampleData);
        DataString := SB.ToString;
      finally SB.Free;
      end;

      // Invoke the simple parser using the string constant.
      QueryPerformanceCounter(T1);

      ParseDummyXMLToRecordArray(DataString, DummyArray);

      QueryPerformanceCounter(T2);
      QueryPerformanceFrequency(F);
      WriteLn(((T2-T1) * 1000) div F);

      // Test parse validity.
      for i:=1 to NodeIterations do
      begin
        if DummyArray[i].D1 <> 'randomdata' then raise Exception.Create('Bug. D1 doesn''t have the proper value, i=' + IntToStr(i));
        if DummyArray[i].D2 <> 'randomdata' then raise Exception.Create('Bug. D2 doesn''t have the proper value, i=' + IntToStr(i));
        if DummyArray[i].D3 <> 'randomdata' then raise Exception.Create('Bug. D3 doesn''t have the proper value, i=' + IntToStr(i));
        if DummyArray[i].D4 <> 'randomdata' then raise Exception.Create('Bug. D4 doesn''t have the proper value, i=' + IntToStr(i));
        if DummyArray[i].D5 <> 'randomdata' then raise Exception.Create('Bug. D5 doesn''t have the proper value, i=' + IntToStr(i));
        if DummyArray[i].D6 <> 'randomdata' then raise Exception.Create('Bug. D6 doesn''t have the proper value, i=' + IntToStr(i));
        if DummyArray[i].D7 <> 'randomdata' then raise Exception.Create('Bug. D7 doesn''t have the proper value, i=' + IntToStr(i));
        if DummyArray[i].D8 <> 'randomdata' then raise Exception.Create('Bug. D8 doesn''t have the proper value, i=' + IntToStr(i));
        if DummyArray[i].D9 <> 'randomdata' then raise Exception.Create('Bug. D9 doesn''t have the proper value, i=' + IntToStr(i));
        if DummyArray[i].D10 <> 'randomdata' then raise Exception.Create('Bug. D10 doesn''t have the proper value, i=' + IntToStr(i));
        if DummyArray[i].D11 <> 'randomdata' then raise Exception.Create('Bug. D11 doesn''t have the proper value, i=' + IntToStr(i));
        if DummyArray[i].D12 <> 'randomdata' then raise Exception.Create('Bug. D12 doesn''t have the proper value, i=' + IntToStr(i));
        if DummyArray[i].D13 <> 'randomdata' then raise Exception.Create('Bug. D13 doesn''t have the proper value, i=' + IntToStr(i));
        if DummyArray[i].D14 <> 'randomdata' then raise Exception.Create('Bug. D14 doesn''t have the proper value, i=' + IntToStr(i));
        if DummyArray[i].D15 <> 'randomdata' then raise Exception.Create('Bug. D15 doesn''t have the proper value, i=' + IntToStr(i));
        if DummyArray[i].D16 <> 'randomdata' then raise Exception.Create('Bug. D16 doesn''t have the proper value, i=' + IntToStr(i));
        if DummyArray[i].D17 <> 'randomdata' then raise Exception.Create('Bug. D17 doesn''t have the proper value, i=' + IntToStr(i));
        if DummyArray[i].D18 <> 'randomdata' then raise Exception.Create('Bug. D18 doesn''t have the proper value, i=' + IntToStr(i));
        if DummyArray[i].D19 <> 'randomdata' then raise Exception.Create('Bug. D19 doesn''t have the proper value, i=' + IntToStr(i));
        if DummyArray[i].D20 <> 'randomdata' then raise Exception.Create('Bug. D20 doesn''t have the proper value, i=' + IntToStr(i));
      end;

    except on E: Exception do Writeln(E.ClassName, ': ', E.Message);
    end;
  finally
    WriteLn('ENTER to Exit');
    ReadLn;
  end;
end.
Bruce McGee
  • 15,076
  • 6
  • 55
  • 70
Cosmin Prund
  • 25,498
  • 2
  • 60
  • 104
  • I aim to achieve a bit of both :) Your code in my computer executes in ~40ms for the parsing block (around 80 in mine), yours is more like a serial parser I guess. It doesn't fit well to apply in different/generic scenarios, as here we know exactly what the subnodes names are, but it could be adapted to trigger an event outside the parser, providing the app with a node number, the subnode name and its data, then the app do the job to assign it to whichever record it has. The approach in my parser is a bit more generic, sampling the file to gather node boundaries/pointers, then performs – hikari Jan 05 '13 at 20:58
  • subnode data requests on a selected node, example parse code: http://pastebin.com/kgkCU6Yx it supports also the CDATA tag in the payload being taken into account. – hikari Jan 05 '13 at 20:59
  • It may seem a bit too specific data, but this format is very common to a lot of web services using RESTful apis, RSS feeds, etc. – hikari Jan 05 '13 at 21:01
  • I'll accept this answer as it fits the question's purpose, though it would need some work to turn it into a usable object (implement Notify trigger events for subnode data availability and end of file, then the App would increment the array/list by 1 if the provided node number is new/higher, then fill the given subnode). – hikari Jan 05 '13 at 21:05
  • 1
    @hikari, XML is not a database, you don't Query it - you parse and then you're done with it. If you need complex Queries then you parse it, put the data into appropriate data structures and then you query those data structures. You keep justifying your own way of doing things, but unfortunately you're wrong on multiple levels: You shouldn't write your own parser, if you do parse you should aim for an O(n) algorithm (ie: no iterations), and you shouldn't query XML. You shouldn't even consider Assembler [...] – Cosmin Prund Jan 05 '13 at 21:06
  • 1
    [..] Try asking some questions about the most effective way of using existing libraries. For example your attempt to use NativeXml, the one you posted in the comments section, is far from effective - because of the use of `NodeByName`. If you made similar mistakes with other libraries then no surprise your code seems so fast. I'm all for writing Parsers and re-inventing the wheel, but if you do, do it right. Start reading stuff about compiler theory, parser generators, etc. Look into the GOLD Parser Generator, it's fairly nice. Using `Pos()` and substring manipulation is a naive implementation – Cosmin Prund Jan 05 '13 at 21:11
  • Delphi allows such long of constants? I've always had compiler errors with strings that long... – Jerry Dodge Jan 07 '13 at 16:26
  • GOLD Parser is awesome, does a lot of stuff (including Pascal), however I believe it's licensed and if you use it from a compiled app (not in debug) then you will get periodic warnings that it is a "trial version" of GOLD Parser. – Jerry Dodge Jan 07 '13 at 16:29
  • @JerryDodge; That long constant is obtain through concatenation. About GOLD: GOLD itself is Free and, according the author, will always be free: http://goldparser.org/about/faq.htm – Cosmin Prund Jan 08 '13 at 21:49
  • Then they must have changed it, when running a compiled version every 2 minutes it shows a message. That was about 2 years ago. – Jerry Dodge Jan 09 '13 at 19:49
  • @JerryDodge, maybe we're not talking about the same product. GOLD Parser, *never* produced actual code, you can't compile the output of GOLD Parser. That's it's whole philosophy, it generates the parse tables into some file and you write your own "engine". The theory goes on to explain how generating those parse tables is the difficult part, how not linking the grammar to any particular language allows the creation of cross-platform grammars, etc. Of course, you don't need to write your own engine as there are two Delphi engines available. Yet I did write mine. – Cosmin Prund Jan 09 '13 at 20:21
  • I could be wrong, but I was very sure it was GOLD Parser. I did use the Delphi Pascal grammar. Either way, I'll stop filling up this question with comments. – Jerry Dodge Jan 10 '13 at 13:53
1

If your XML is that easy, and format is fixed, and file is that big, and you need really fast processing, I would recommend to implement parsing by yourself, with simple while (i < length(unputStr)) do cycle. There you can search for '<' symbol, extract node names, etc, etc.

Nickolay Olshevsky
  • 13,706
  • 1
  • 34
  • 48
  • I'm already using my own parser, but I want something faster. The node elements are not fixed however, both the root nodes and the subnodes under it have different values in different XML files. – hikari Jan 05 '13 at 19:22
  • 1
    You can extract there names in your parser. Anyway, every XML parser would be slower than own parser, which pre-knows file structure. – Nickolay Olshevsky Jan 05 '13 at 19:33
  • +1. The fastest way is to treat as a string and not try to load into an XML document. Use pointers into the buffer, if that's not fast enough. Convert to Assembler if that's not fast enough. – Chris Thornton Jan 05 '13 at 20:28
  • +1 on the comment of the parser already knowing the file structure. If the XML structure is extremely straight-forward and you always know what nodes, properties, and data is coming up, then you may be better off with your own parser, as far as speed of reading the data. But as everyone keep saying, even if you can get it down to 0.0001 msec, you still have the issue of HDD speed, or speed of whatever means you're loading that XML data. – Jerry Dodge Jan 07 '13 at 16:32