How to set MemoryStream position based on IndexOf, to split apart a sequence of XML documents?

Question

I have a pseudo XML file with 5 small xmls in it like so:

What I am trying to achieve is separate and create a new file for each of these XMLs using MemoryStream with this code:

int flag = 0;

byte[] arr = Encoding.ASCII.GetBytes(File.ReadAllText(@"C:\\Users\\Aleksa\\Desktop\\testTxt.xml"));

for (int i = 0; i <= 5; i++)
{
    MemoryStream mem = new MemoryStream(arr);
    mem.Position = flag;
    StreamReader rdr = new StreamReader(mem);

    string st = rdr.ReadToEnd();

    if (st.IndexOf("<TestNode") != -1 && (st.IndexOf("</TestNode>") != -1 || st.IndexOf("/>") != -1))
    {
        int curr = st.IndexOf("<TestNode");
        int end = st.IndexOf("\r");
        string toWrite = st.Substring(st.IndexOf("<TestNode"), end);
        File.WriteAllText(@"C:\\Users\\Aleksa\\Desktop\\" + i.ToString() + ".xml", toWrite);
        flag += end;
    }
    Console.WriteLine(st);
}

The first XML from the image gets separated and is okay, the rest are empty files, while debugging I noticed that even though I set the position to be the end variable it still streams from the top, also all iterations after the first have the end variable equal to zero!

I have tried changing the IndexOf parameter to </TestNode> + 11 which does the same as the code above except the rest of the files aren't empty but are not complete, leaving me with <TestNode a. How can I fix the logic here and split my stream of XML document(s) apart?

Why don't you just enclose the string you've read in a `` element so you can read it as XML and use XML tools to pull it back apart? — Damien_The_Unbeliever, Aug 02 '19 at 16:23
Because this is practice and I have to follow doc specs for it - I get endless streamed strings which I place in a file from a remote server, and then using MemoryStream I have to pull those strings out of the file like this and create for each of them a separate xml file and I dont know if enclosing strings in a dummynode would be good in this situation, but giving it a chance, how can I go about and enclose it in a dummynode - keep in mind that the file gets new strings all the time - and then create separate little xmls? @Damien_The_Unbeliever — MicroDev92, Aug 02 '19 at 18:48
Also, there are only 5 mini xmls here because in reality I dont have a remote server sending me strings, but I have to write the logic like it does send them..the last mini xml string doesnt have the end tag but is enclosed in itself because of the docs too, some strings are “sent” like this...weird I know, but that is my homework zzz — MicroDev92, Aug 02 '19 at 18:49
You can read such a stream of XML fragments as-is by setting `XmlReaderSettings { ConformanceLevel = ConformanceLevel.Fragment };` beforehand. No need for anything manual. See e.g. [Read nodes of a xml file in C#](https://stackoverflow.com/a/46476652/3744182) or [Fragmented XML string parsing with Linq](https://stackoverflow.com/q/6114789/3744182). — dbc, Aug 03 '19 at 00:40
Demo of using `ConformanceLevel = ConformanceLevel.Fragment` here: https://dotnetfiddle.net/Ef8CfV. Does this meet your needs? Since you mention "homework" you might have some additional constraints that disallow this solution. — dbc, Aug 03 '19 at 01:31
@dbc, Sorry about the delayed answer I wasn't near my PC due to personal reasons, well this actually meets my needs and it works beautifully but the code is really advanced and I could use a little walkthrough the code, at least some bits -- `XmlSerializationHelper` is used for? also I noticed that if I remove the `foo` nodes from the xml's they will be stored like so - `` and like `` can we control that? — MicroDev92, Aug 04 '19 at 14:01
@dbc, You could write an answer so I can mark it as correct with a little explanation? This code is okay for the practice specs, as long as I don't change the main XML, but 1 more thing I failed to mention is - "Sometimes the server will respond with an incomplete string and finish the rest of the string in the next response, You have to find a way to handle such situations" - this is a rough translation of the text since it's not native English, so basically I thought about `try/catch` and in the catch I would tell the program to start iterating over but that would create duplicates, right? — MicroDev92, Aug 04 '19 at 14:12
@MicroDev92 - `XmlSerializationHelper` was boilerplate code from a template fiddle. I deleted it since it wasn't used. The important code is in `XmlReaderExtensions.ReadRoots()` and `TestClass.WriteFiles()`. Anyway, since that meets your needs, I'll go ahead and add an answer. — dbc, Aug 04 '19 at 18:03
@MicroDev92 - Fiddle clean up a little more: https://dotnetfiddle.net/Ef8CfV. But the business about the stream being truncated and continued in another response is problematic, because `XmlReader` doesn't give easily digested accounts of its position in the stream. Could you please [edit] your question to add this requirement? I'm not sure this approach works given this added requirement. — dbc, Aug 04 '19 at 18:17
@dbc Don’t worry about it, I will look further into the positioning, You gave an answer to my initial question, I don’t want to nitpick, just thought to mention it since I forgot initially, go and add an answer! Just please - ReadRoots explanation would be helpful! — MicroDev92, Aug 04 '19 at 19:27
Do you know whether each XML fragment will appear on exactly one line? — dbc, Aug 04 '19 at 21:24

score 2 · Accepted Answer · answered Aug 04 '19 at 20:26

Your input stream consists of XML document fragments -- i.e. a series of XML root elements concatenated together.

You can read such a stream by using an XmlReader created with XmlReaderSettings.ConformanceLevel == ConformanceLevel.Fragment. From the docs:

Fragment

Ensures that the XML data conforms to the rules for a well-formed XML 1.0 document fragment.

This setting accepts XML data with multiple root elements, or text nodes at the top-level.

The following extension methods can be used for this task:

public static class XmlReaderExtensions
{
    public static IEnumerable<XmlReader> ReadRoots(this XmlReader reader)
    {
        while (reader.Read())
        {
            if (reader.NodeType == XmlNodeType.Element)
            {
                using (var subReader = reader.ReadSubtree())
                    yield return subReader;
            }
        }
    }

    public static void SplitDocumentFragments(Stream stream, Func<int, string> makeFileName, Action<string, IXmlLineInfo> onFileWriting, Action<string, IXmlLineInfo> onFileWritten)
    {
        using (var textReader = new StreamReader(stream, Encoding.UTF8, true, 4096, true))
        {
            SplitDocumentFragments(textReader, makeFileName, onFileWriting, onFileWritten);
        }
    }

    public static void SplitDocumentFragments(TextReader textReader, Func<int, string> makeFileName, Action<string, IXmlLineInfo> onFileWriting, Action<string, IXmlLineInfo> onFileWritten)
    {
        if (textReader == null || makeFileName == null)
            throw new ArgumentNullException();
        var settings = new XmlReaderSettings { ConformanceLevel = ConformanceLevel.Fragment, CloseInput = false };
        using (var xmlReader = XmlReader.Create(textReader, settings))
        {
            var lineInfo = xmlReader as IXmlLineInfo;
            var index = 0;

            foreach (var reader in xmlReader.ReadRoots())
            {
                var outputName = makeFileName(index);
                reader.MoveToContent();
                if (onFileWriting != null)
                    onFileWriting(outputName, lineInfo);
                using(var writer = XmlWriter.Create(outputName))
                {
                    writer.WriteNode(reader, true);
                }
                index++;
                if (onFileWritten != null)
                    onFileWritten(outputName, lineInfo);
            }
        }
    }
}

Then you would use it as follows:

var fileName = @"C:\\Users\\Aleksa\\Desktop\\testTxt.xml";
var outputPath = ""; // The directory in which to create your XML files.
using (var stream = File.OpenRead(fileName))
{
    XmlReaderExtensions.SplitDocumentFragments(stream,
                                               index => Path.Combine(outputPath, index.ToString() + ".xml"),
                                               (name, lineInfo) => 
                                               {
                                                   Console.WriteLine("Writing {0}, starting line info: LineNumber = {1}, LinePosition = {2}...", 
                                                                     name, lineInfo?.LineNumber, lineInfo?.LinePosition);
                                               },
                                               (name, lineInfo) => 
                                               {
                                                   Console.WriteLine("   Done.  Result: ");
                                                   Console.Write("   ");
                                                   Console.WriteLine(File.ReadAllText(name));
                                               });
}

And the output will look something like:

Writing 0.xml, starting line info: LineNumber = 1, LinePosition = 2...
   Done.  Result: 
   <?xml version="1.0" encoding="utf-8"?><TestNode active="1" lastName="l"><Foo /> </TestNode>
Writing 1.xml, starting line info: LineNumber = 2, LinePosition = 2...
   Done.  Result: 
   <?xml version="1.0" encoding="utf-8"?><TestNode active="2" lastName="l" />
Writing 2.xml, starting line info: LineNumber = 3, LinePosition = 2...
   Done.  Result: 
   <?xml version="1.0" encoding="utf-8"?><TestNode active="3" lastName="l"><Foo />  </TestNode>

... (others omitted).

Notes:

The method ReadRoots() reads through all the root elements of the XML fragment stream returns a nested reader restricted to just that specific root, by using XmlReader.ReadSubtree():

Returns a new XmlReader instance that can be used to read the current node, and all its descendants. ... When the new XML reader has been closed, the original reader is positioned on the EndElement node of the sub-tree.

This allows callers of the method to parse each root individually without worrying about reading past the end of the root and into the next one. Then the contents of each root node can be copied to an output XmlWriter using XmlWriter.WriteNode(XmlReader, true).
You can track approximate position in the file using the IXmlLineInfo interface which is implemented by XmlReader subclasses that parse text streams. If your document fragment stream is truncated for some reason, this can help identify where the error occurs.

See: getting the current position from an XmlReader and C# how can I debug a deserialization exception? for details.
If you are parsing a string st containing your XML fragments rather that reading directly from a file, you can pass a StringReader to SplitDocumentFragments():
```
using (var textReader = new StringReader(st))
{
        XmlReaderExtensions.SplitDocumentFragments(textReader, 
// Remainder as before
```
Do not read an XML stream using Encoding.ASCII, this will strip all non-English characters from the file. Instead, use Encoding.UTF8 and/or detect the encoding from the BOM or XML declaration.

Demo fiddle here.

Thanks a lot, for Your help and patience, I really appreciate it! :) — MicroDev92, Aug 05 '19 at 09:22

How to set MemoryStream position based on IndexOf, to split apart a sequence of XML documents?

1 Answers1