using C#'s XmlReader on slightly malformed XML

Question

I'm trying to use C#'s XmlReader on a large series of XML files, they are all properly formatted except for a few select ones (unfortunately I'm not in a position to have them changed, because it would break a lot of other code).

The errors only come from one specific part of the these affronting XML files and it's ok to just skip them but I don't want to stop reading the rest of the XML file.

The bad parts look like this:

 <InterestingStuff>
  ...
    <ErrorsHere OptionA|Something = "false" OptionB|SomethingElse = "false"/>
    <OtherInterestingStuff>
    ...
    </OtherInterestingStuff>
</InterestingStuff>

So really if I could just ignore invalid tags, or ignore the pipe symbol then I would be ok.

Trying to use XmlReader.Skip() when I see the name "ErrorsHere" doesn't work, apparently it already reads a bit ahead and throws the exception.

TLDR: How do I skip so I can read in the XML file above, using the XmlReader?

Edit:

Some people suggested just replacing the '|'-symbol, but the idea of XmlReader is to not load the entire file but only traverse parts you want, since I'm reading directly from files I can not afford the read in entire files, replace all instances of '|' and then read parts again :).

replacing | sign before loading a reader with - could solve the problem — Prashant Lakhlani, Jul 11 '11 at 10:50
How are you reading the info into the XmlReader? Are you reading from stream? — Jethro, Jul 11 '11 at 10:50
if you know the error in advance, can't you patch the content of the source before parsing it ? But in a general manner, you should correct the source xml.... or don't call it XML (I imagine you are dependant of someone else... ?) — Steve B, Jul 11 '11 at 10:51
Good suggestions, but I don't want to read in the entire file (hence why I use XmlReader and not XmlDocument.Load()) because this could be costly and I don't need all the info in the files. To clarify I read directly from disk using XmlReader.Creat(filepath) and yes I'm depending on other people, so I can't do anything about the source. — Roy T., Jul 11 '11 at 11:10
+1 Steve B - `XmlReader` reads Xml, so convert the non-Xml input to valid Xml in an isolated method, keeping the rest of your code clean. — C.Evenhuis, Jul 11 '11 at 11:12

score 4 · Accepted Answer · answered Jul 11 '11 at 11:27

4

I've experimented a bit with this in the past.

In general the input simply has to be well-formed. An XmlReader will go into an unrecoverable error-state when the basic XML rules are broken. It is easy to avoid schema-validation but that's not relevant here.

Your only option is to clean the input, that can be done in a streaming manner (custom Stream or TextReader) but that will require a light form of parsing. If you don't have pipe-symbols in valid positions it's easy.

answered Jul 11 '11 at 11:27

H H

263,252
30
330
514

Hey Henk, this seems to me like the best solution. I also tried just loading the entire file and replacing the pipes but this made parsing take twice as long (even though I used a memory stream to store the loaded data). Extending a stream or TextReader seems like a good idea to keep being performant. – Roy T. Jul 11 '11 at 11:52

score 1 · Answer 2 · answered Jul 11 '11 at 11:16

1

XmlReader is strict. Any non-conformance, it will error.

So no, you can't do that unless you write your own xml implementation. Fixup on the malformed data is probably easier.

answered Jul 11 '11 at 11:16

Marc Gravell

1,026,079
266
2,566
2,900

score 1 · Answer 3 · answered Jul 11 '11 at 11:21

Once I had a similar situation (with HTML files, not XML files). But I ended up using regular expression for each HTML file before entering it into my operation pipeline, to delete malformed parts. It came handy and was easier than struggling with the API. :)

using C#'s XmlReader on slightly malformed XML

3 Answers3