Perhaps my understanding of what is supposed to happen is faulty, so hopefully someone can correct my thought process here.
I am trying to process many large XML files that are constantly being sent to us with bad characters embedded in the text (0x1A)... unfortunately, its our customer that is sending the files so no matter how nicely we ask them to make the files actually be valid XML, they consider it our problem.
Ultimately I wrote a subclass of StreamReader
like so:
public class CleanTextReader : StreamReader
{
private readonly ILog _logger;
public CleanTextReader(Stream stream, ILog logger) : base(stream)
{
this._logger = logger;
}
public CleanTextReader(Stream stream) : this(stream, LogManager.GetLogger<CleanTextReader>())
{
//nothing to do here.
}
public override int Read(char[] buffer, int index, int count)
{
try
{
var rVal = base.Read(buffer, index, count);
var filteredBuffer = buffer.Select(x => XmlConvert.IsXmlChar(x) ? x : ' ').ToArray();
Buffer.BlockCopy(filteredBuffer, 0, buffer, 0, count);
return rVal;
}
catch (Exception ex)
{
this._logger.Error("Read(char[], int, int)", ex);
throw;
}
}
public override int ReadBlock(char[] buffer, int index, int count)
{
try
{
var rVal = base.ReadBlock(buffer, index, count);
var filteredBuffer = buffer.Select(x => XmlConvert.IsXmlChar(x) ? x : ' ').ToArray();
Buffer.BlockCopy(filteredBuffer, 0, buffer, 0, count);
return rVal;
}
catch (Exception ex)
{
this._logger.Error("ReadBlock(char[], in, int)", ex);
throw;
}
}
public override string ReadToEnd()
{
var chars = new char[4096];
int len;
var sb = new StringBuilder(4096);
while ((len = Read(chars, 0, chars.Length)) != 0)
{
sb.Append(chars, 0, len);
}
return sb.ToString();
}
}
... then I implement the XmlReader
like so:
using (var theCleanser = new CleanTextReader(myStreamedInput))
using (var theReader = XmlReader.Create(theCleanser))
{
...
// do stuff with theReader
}
I have a unit test like so:
[TestMethod]
public void CleanTextReaderCleans0X1A()
{
//arrange
var originalString = "The quick brown fox jumped over the lazy dog.";
var badChars = new string(new[] {(char) 0x1a});
var concatenated = originalString.Replace("jumped", badChars + "jumped" + badChars);
//act
using (var stream = new MemoryStream(Encoding.UTF8.GetBytes(concatenated)))
{
using (var reader = new CleanTextReader(stream))
{
var newString = reader.ReadToEnd().Trim().Replace(" ", " ");
//assert
Assert.IsTrue(originalString.Equals(newString));
}
}
}
...this passes.
but when I try to parse an XML file with a 0x1A character in it, I still get a System.Xml.XmlException
: '', hexadecimal value 0x1A, is an invalid character. Line XX, position XX
Digging deeper into the CleanTextReader
I examine the Read(char[], int, int)
method, as it seems to be being hit by the XmlReader
. The original buffer
has the illegal characters, but the filteredBuffer
does not, and after the Buffer.BlockCopy()
is run, both the buffer
and the filteredBuffer
are devoid of special characters.
Also of note, I discovered that the line number and position reference not the first instance of an invalid character, but the second, so it sees the first and corrects it, but only once.
So I'm scratching my head here. How does the XmlReader
get the special characters? Is it reading from the buffer before control returns from the method? How do I fix this problem?
UPDATE
per request, here is the stack trace I'm getting:
"System.Xml.XmlException: '', hexadecimal value 0x1A, is an invalid character. Line 84, position 38.
at System.Xml.XmlTextReaderImpl.Throw(Exception e)
at System.Xml.XmlTextReaderImpl.Throw(String res, String[] args)
at System.Xml.XmlTextReaderImpl.ParseText(Int32& startPos, Int32& endPos, Int32& outOrChars)
at System.Xml.XmlTextReaderImpl.ParseText()
at System.Xml.XmlTextReaderImpl.ParseElementContent()
at System.Xml.XmlTextReaderImpl.Read()
at System.Xml.Linq.XContainer.ReadContentFrom(XmlReader r)
at System.Xml.Linq.XContainer.ReadContentFrom(XmlReader r, LoadOptions o)
at System.Xml.Linq.XElement.ReadElementFrom(XmlReader r, LoadOptions o)
at System.Xml.Linq.XNode.ReadFrom(XmlReader reader)
at MyCompany.Importers.GroupEligibilityModel.Loader.<GetGroupEligibilityElements>d__2b.MoveNext() in c:\\Projects\\MyCompanyHealth\\MyCompany.Importers\\MyCompany.Importers.GroupEligibilityModel\\MyCompany.Importers.GroupEligibilityModel\\Loader.cs:line 138
at MyCompany.Importers.GroupEligibilityModel.Loader.<GetGroupEligibilities>d__18.MoveNext() in c:\\Projects\\MyCompanyHealth\\MyCompany.Importers\\MyCompany.Importers.GroupEligibilityModel\\MyCompany.Importers.GroupEligibilityModel\\Loader.cs:line 71
at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)
at MyCompany.Importers.GroupEligibilityModel.Test.LoadingTests.GroupEligibilityFileWithBadCharactersProperlyCleansed() in c:\\Projects\\MyCompanyHealth\\MyCompany.Importers\\MyCompany.Importers.GroupEligibilityModel\\MyCompany.Importers.GroupEligibilityModel.Test\\LoadingTests.cs:line 118" string