1

I need to parse standard XML structures coming from a TCP/IP connection. The data is kept as a string variable. This means that in any given time the data in my hand can be incomplete (an incomplete XML structure), or a complete XML structure with incomplete leftover (the beginning of the next XML structure).

Most of the structures are not 'empty':

<Message>
  <Param1 value = "val1"/>
  <Param2 value = "val2"/>
</Message>

But there are also 'empty' ones:

<Message status="ack" />

So just searching for </Message> and making a split there is not good enough.

How can I part the complete structure from the next partial structure? Is there a cleaner solution other than creating my own state-machine for this and checking byte by byte?

Hadar Ben David
  • 189
  • 1
  • 8
  • Perhaps this helps: http://stackoverflow.com/questions/55828/how-does-one-parse-xml-files?rq=1 – Jose Luis May 14 '17 at 18:25
  • The big issue here is partial xml structures are not xml structures, they are invalid mark up. Is there anyway you can get away from xml? – Tony Hopkinson May 14 '17 at 18:28
  • 1
    It sounds like you should work on the higher level protocol so that you know how many bytes to expect, and can cleanly differentiate between documents. Is this a protocol you control? – Jon Skeet May 15 '17 at 06:47
  • Jon, this is not a protocol I control. If I were to design such protocol I would have put a constant ending token to be able to easily differentiate between two consecutive messages. – Hadar Ben David May 15 '17 at 15:24
  • We have this exact same issue, i.e., how to identify XML messages sent over a TCP/IP connection. At any point in the servicing of the socket, we may have a partial XML message, a complete XML message, or multiple XML messages concatenated together. We need a way to identify each message (presumably by looking for the opening tag and its corresponding closing tag), extracting the message from the buffer, and then parsing it with `XmlDocument`. For us, we're resorted to writing our own state machine. I had hoped for a community-proven solution. – Matt Davis Oct 29 '18 at 17:51

1 Answers1

0

You can use a dictionary for each message

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            string input =
                "<Message>" +
                  "<Param1 value = \"val1\"/>" +
                  "<Param2 value = \"val2\"/>" +
                "</Message>" +
                "<Message>" +
                  "<Param1 value = \"val1\"/>" +
                  "<Param2 value = \"val2\"/>" +
                "</Message>";


            XElement message = 
                new XElement("Root", input);

            var results = message.Elements("Message")
                .Where(x => x.HasElements)
                .Select(x => x.Elements()
                    .GroupBy(y => y.Name.LocalName, z => z)
                    .ToDictionary(y => y.Key, z => (string)z.FirstOrDefault()
                        .Attribute("value")))
                .ToList();
        }
    }
}
jdweng
  • 33,250
  • 2
  • 15
  • 20
  • Thank you for your suggestion jdweng. However it seems that XElement.Parse(input); throws an exception when trying to parse an incomplete XML structure. – Hadar Ben David May 14 '17 at 21:18
  • 1
    With XML you must wait for all the data to occur. The xml tags must be closed. With TCP the message is broken into datagrams with max size ~1500 bytes. So first with TCP you need to know when each message terminates and continue reading TCP data until the entire message is received. In this case you can parse message into pieces searching for as the terminator. – jdweng May 15 '17 at 00:55
  • jdweng, I could have searched for were it not for 'empty' nodes such as: – Hadar Ben David May 15 '17 at 15:26
  • You still can. You will just gate both an empty and full element together. All you really are concerned about is getting a partial element. – jdweng May 15 '17 at 15:31