0

I have some data (produced by a legacy application) that I know is invalid XML, for example:

<document>
  <dossier>
    <answers>
      <answer>Ref=some <text> here</answer>
    </answers>
  </dossier>
</document>

I want to load this into an XmlDocument, and it's currently failing because it's treating "<text>" as a tag. Please note that this is just an example. The general problem is that answers can contain unescaped angle brackets in any order with different characters in between.

What options do I have?

matthewk
  • 1,841
  • 17
  • 31
  • You simply can't load invalid XML into an `XmlDocument`. Maybe you should try to escape the angle brackets yourself? – svick Mar 26 '12 at 12:09
  • Will the badly formed part always exist only in known parts of the document (e.g. in xpath: /document/dossier/answers/answer) or could it appear all over the place? – Rob Levine Mar 26 '12 at 12:10
  • 1
    nitpick: this isn't "invalid" xml, this is "badly formed" xml (i.e. not well-formed). "Valid"/"Invalid" are really terms reserved for whether the xml is valid against a given schema. "Well formed"/"Badly formed" are terms describing whether the xml-like text can really be considered xml at all. – Rob Levine Mar 26 '12 at 12:13
  • Rob Levine, it will only appear in /document/dossier/answers/answer. – matthewk Mar 26 '12 at 13:17

3 Answers3

0

You can use Regex for example and escape the content inside <answer> </answer> before parsing it with XmlDocument.

Match with something like <answer>(.+?)</answer> and replace the captured group with the escaped version.

Israel Lot
  • 653
  • 1
  • 9
  • 16
  • Although Andrew Bullock is technically correct, this might just work for me, given that the "XML" is pretty simple, and that answer tags only appear in this position, and that it's fairly unlikely that any of the answer tags contain or . I need to do some more testing before I accept, but thanks for the idea. – matthewk Mar 26 '12 at 13:03
  • You are welcome. I also think like you: the important is to get it done. If you know what's wrong with the xml in advance, you can always fix it thinking as text processing. If you need to guess what's wrong, then you got a problem. In your case I believe it will do the trick. – Israel Lot Mar 26 '12 at 13:21
0

use the HTMLAgilityPack. this can handle invalid/malformed markup, it does a pretty good job.

Andrew Bullock
  • 36,616
  • 34
  • 155
  • 231
0

The simplest thing to do would be to wrap the offending XML in CData section. That way, the resulting XML document could look like this:

<wrapper>
    <![CDATA[
        <document>
          <dossier>
            <answers>
              <answer>Ref=some <text> here</answer>
            </answers>
          </dossier>
        </document>
    ]]>
</wrapper>

More details about CData can be found here.

Nikola Anusev
  • 6,940
  • 1
  • 30
  • 46
  • Thanks, but as I said, the XML was generated by a legacy application, which I cannot change. – matthewk Mar 26 '12 at 13:39
  • @matthewk You don't need to change the XML that is being generated. When saving this XML to your XmlDocument, you could just wrap it into CData section. Or is that not acceptable as well? – Nikola Anusev Mar 26 '12 at 13:42
  • Sorry, perhaps I didn't make it clear in the original question, but I need to be able to process the XML after I have loaded it into the XmlDocument. – matthewk Mar 26 '12 at 14:36