3

Perhaps my understanding of what is supposed to happen is faulty, so hopefully someone can correct my thought process here.

I am trying to process many large XML files that are constantly being sent to us with bad characters embedded in the text (0x1A)... unfortunately, its our customer that is sending the files so no matter how nicely we ask them to make the files actually be valid XML, they consider it our problem.

Ultimately I wrote a subclass of StreamReader like so:

public class CleanTextReader : StreamReader
{
    private readonly ILog _logger;

    public CleanTextReader(Stream stream, ILog logger) : base(stream)
    {
        this._logger = logger;
    }

    public CleanTextReader(Stream stream) : this(stream, LogManager.GetLogger<CleanTextReader>())
    {
        //nothing to do here.
    }
    public override int Read(char[] buffer, int index, int count)
    {
        try
        {
            var rVal = base.Read(buffer, index, count);

            var filteredBuffer = buffer.Select(x => XmlConvert.IsXmlChar(x) ? x : ' ').ToArray();

            Buffer.BlockCopy(filteredBuffer, 0, buffer, 0, count);
            return rVal;
        }
        catch (Exception ex)
        {
            this._logger.Error("Read(char[], int, int)", ex);
            throw;
        }
    }

    public override int ReadBlock(char[] buffer, int index, int count)
    {
        try
        {
            var rVal = base.ReadBlock(buffer, index, count);
            var filteredBuffer = buffer.Select(x => XmlConvert.IsXmlChar(x) ? x : ' ').ToArray();
            Buffer.BlockCopy(filteredBuffer, 0, buffer, 0, count);
            return rVal;
        }
        catch (Exception ex)
        {
            this._logger.Error("ReadBlock(char[], in, int)", ex);
            throw;
        }
    }

    public override string ReadToEnd()
    {
        var chars = new char[4096];
        int len;
        var sb = new StringBuilder(4096);
        while ((len = Read(chars, 0, chars.Length)) != 0)
        {
            sb.Append(chars, 0, len);
        }
        return sb.ToString();
    }
}

... then I implement the XmlReader like so:

using (var theCleanser = new CleanTextReader(myStreamedInput))
using (var theReader = XmlReader.Create(theCleanser))
{
    ...
    // do stuff with theReader
}

I have a unit test like so:

    [TestMethod]
    public void CleanTextReaderCleans0X1A()
    {
        //arrange
        var originalString = "The quick brown fox jumped over the lazy dog.";
        var badChars = new string(new[] {(char) 0x1a});
        var concatenated = originalString.Replace("jumped", badChars + "jumped" + badChars);

        //act
        using (var stream = new MemoryStream(Encoding.UTF8.GetBytes(concatenated)))
        {
            using (var reader = new CleanTextReader(stream))
            {
                var newString = reader.ReadToEnd().Trim().Replace("  ", " ");
                //assert
                Assert.IsTrue(originalString.Equals(newString));
            }
        }
    }

...this passes.

but when I try to parse an XML file with a 0x1A character in it, I still get a System.Xml.XmlException: '', hexadecimal value 0x1A, is an invalid character. Line XX, position XX

Digging deeper into the CleanTextReader I examine the Read(char[], int, int) method, as it seems to be being hit by the XmlReader. The original buffer has the illegal characters, but the filteredBuffer does not, and after the Buffer.BlockCopy() is run, both the buffer and the filteredBuffer are devoid of special characters.

Also of note, I discovered that the line number and position reference not the first instance of an invalid character, but the second, so it sees the first and corrects it, but only once.

So I'm scratching my head here. How does the XmlReader get the special characters? Is it reading from the buffer before control returns from the method? How do I fix this problem?

UPDATE

per request, here is the stack trace I'm getting:

"System.Xml.XmlException: '', hexadecimal value 0x1A, is an invalid character. Line 84, position 38.
   at System.Xml.XmlTextReaderImpl.Throw(Exception e)
   at System.Xml.XmlTextReaderImpl.Throw(String res, String[] args)
   at System.Xml.XmlTextReaderImpl.ParseText(Int32& startPos, Int32& endPos, Int32& outOrChars)
   at System.Xml.XmlTextReaderImpl.ParseText()
   at System.Xml.XmlTextReaderImpl.ParseElementContent()
   at System.Xml.XmlTextReaderImpl.Read()
   at System.Xml.Linq.XContainer.ReadContentFrom(XmlReader r)
   at System.Xml.Linq.XContainer.ReadContentFrom(XmlReader r, LoadOptions o)
   at System.Xml.Linq.XElement.ReadElementFrom(XmlReader r, LoadOptions o)
   at System.Xml.Linq.XNode.ReadFrom(XmlReader reader)
   at MyCompany.Importers.GroupEligibilityModel.Loader.<GetGroupEligibilityElements>d__2b.MoveNext() in c:\\Projects\\MyCompanyHealth\\MyCompany.Importers\\MyCompany.Importers.GroupEligibilityModel\\MyCompany.Importers.GroupEligibilityModel\\Loader.cs:line 138
   at MyCompany.Importers.GroupEligibilityModel.Loader.<GetGroupEligibilities>d__18.MoveNext() in c:\\Projects\\MyCompanyHealth\\MyCompany.Importers\\MyCompany.Importers.GroupEligibilityModel\\MyCompany.Importers.GroupEligibilityModel\\Loader.cs:line 71
   at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
   at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)
   at MyCompany.Importers.GroupEligibilityModel.Test.LoadingTests.GroupEligibilityFileWithBadCharactersProperlyCleansed() in c:\\Projects\\MyCompanyHealth\\MyCompany.Importers\\MyCompany.Importers.GroupEligibilityModel\\MyCompany.Importers.GroupEligibilityModel.Test\\LoadingTests.cs:line 118"   string
Jeremy Holovacs
  • 22,480
  • 33
  • 117
  • 254
  • Can you please include the call stack of the System.Xml.XmlException? – helb Aug 11 '15 at 19:58
  • Also, I cannot reproduce your problem, please provide a minimalistic example XML. Specify exactly which characters/bytes are in the file, for example by posting a snapshot from a hex editor. – helb Aug 11 '15 at 20:05
  • 1
    @helb I added the stack trace, but the xml file is a little more worrisome, being customer data. I'll see if I can make an xml file with the bad characters that fails. – Jeremy Holovacs Aug 11 '15 at 20:10
  • I'm having a hard time associating the exception call stack with the code you posted. Which lines throws exactly? – helb Aug 11 '15 at 20:15
  • I think the error is thrown when consuming a node using `var node = XNode.ReadFrom(theReader) as XElement;` I thought that was less relevant since if the filter were working properly, this wouldn't blow up. – Jeremy Holovacs Aug 11 '15 at 20:17
  • There is no XNode.ReadFrom() call in the code you posted... – helb Aug 11 '15 at 20:21
  • No that happens outside this process. That works perfectly fine if given valid xml. The problem is it's still seeing the 0x1A, and it shouldn't be able to. – Jeremy Holovacs Aug 11 '15 at 20:24
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/86732/discussion-between-helb-and-jeremy-holovacs). – helb Aug 11 '15 at 20:26
  • Have you tried a different encoding? It looks like you are expecting UTF-8, but that may not be what you are getting. – Mohair Aug 11 '15 at 20:32
  • @JeremyHolovacs Have you tried the alternatives [here](http://stackoverflow.com/a/13450902/932418) I think 3rd one may work for you. – Eser Aug 11 '15 at 20:49
  • Did you test that your methods are actually being called? Do not derive from StreamReader, that's a hack. Derive from TextReader and wrap a StreamReader. – usr Aug 11 '15 at 22:59
  • @Mohair the xml is UTF-8, so I don't think that's the problem. – Jeremy Holovacs Aug 12 '15 at 01:25

0 Answers0