2

I have a very large XML file so I am using XmlReader in C#. Problem is some of the content contains XML-like markers that should not be processed by XmlReader.

<Narf name="DOH">Mark's test of <newline> like stuff</Narf>

This is legacy data, so it cannot be refactored... (of course)

I have tried ReadInnerXml but get the whole node. I have tried ReadElementContentAsString but get an exception saying 'newline' is not closed.

// Does not deal with markup in the content (Both lines)
ms.mText = reader.ReadElementContentAsString(); 
XElement el = XNode.ReadFrom(reader) as XElement; ms.mText = el.ToString();

What I want is ms.mText to equal "Mark's test of <newline> like stuff" and not an exception.

System.Xml.XmlException was unhandled
  HResult=-2146232000
  LineNumber=56
  LinePosition=63
  Message=The 'newline' start tag on line 56 position 42 does not match the end tag of 'Narf'. Line 56, position 63.
  Source=System.Xml

The duplicate flagged question did not solve the problem because it requires changing the input to remove the problem before using the data. As stated above, this is legacy data.

  • Do you have a list of elements such as in the xml that you can just string replace with say – Yuriy Faktorovich Mar 27 '19 at 15:14
  • 2
    The problem is that the data you have posted is not XML, so obviously XmlReader is going to reject it. If you know what's XML in your data and what's not XML, I suggest creating a preprocessor that strips or converts your not-XML to XML before passing it to XmlReader. – Dour High Arch Mar 27 '19 at 15:19
  • How about creating a `XmlReaderSettings` object and then handle the `ValidationEventHandler`; you can then handle these problems as clearly this is not valid XML... – Trevor Mar 27 '19 at 15:20
  • `XmlTextReader` can take a `TextReader` as a ctor parameter, so you could implement your own `TextReader` which does the replacements in a streaming fashion, without having to pre-process the whole document in one go. – canton7 Mar 27 '19 at 15:21
  • @Çöđěxěŕ I was hopeful this would work, but the same exception occurs before getting to the handler. – Mark Manyen Mar 27 '19 at 15:40
  • @Dour High Arch I know that I want the contents of the "Narf" tag and no other... – Mark Manyen Mar 27 '19 at 15:46
  • Consider [parsing it as a string](https://stackoverflow.com/a/26248614/22437) instead of XML. Note that since you have multiple tag names you will have to introduce additional ReadStates to the provided answer. – Dour High Arch Mar 27 '19 at 15:50
  • 2
    Although every rule ever says to not use regexes for parsing jobs like these, consider them anyway: replacing `` with `<![CDATA[` and `` with `]]>` (either directly in the source data, or with a wrapping reader) essentially "fixes" the mistake in the original while still leaving the rest up to the XML parser. – Jeroen Mostert Mar 27 '19 at 16:15
  • Possible duplicate of [How to parse invalid (bad / not well-formed) XML?](https://stackoverflow.com/questions/44765194/how-to-parse-invalid-bad-not-well-formed-xml) – Progman Mar 27 '19 at 18:12

1 Answers1

0

I figured it out based on responses here! Not elegant, but works...

   public class TextWedge : TextReader
   {
      private StreamReader mSr = null;
      private string mBuffer = "";

      public TextWedge(string filename)
      {
         mSr = File.OpenText(filename);
         // buffer 50
         for (int i =0; i<50; i++)
         {
            mBuffer += (char) (mSr.Read());
         }
      }
      public override int Peek() 
      {
         return mSr.Peek() + mBuffer.Length;
      }

      public override int Read()
      {
         int iRet = -1;
         if (mBuffer.Length > 0)
         {
            iRet = mBuffer[0];
            int ic = mSr.Read();
            char c = (char)ic;
            mBuffer = mBuffer.Remove(0, 1);
            if (ic != -1)
            {
               mBuffer += c;
               // Run through the battery of non-xml tags
               mBuffer = mBuffer.Replace("<newline>", "[br]");
            }
         }
         return iRet;
      }
   }