14

I have an xml string that is being posted to an ashx handler on the server. The xml string is built on the client-side and is based on a few different entries made on a form. Occasionally some users will copy and paste from other sources into the web form. When I try to load the xml string into an XMLDocument object using xmldoc.LoadXml(xmlStr) I get the following exception:

System.Xml.XmlException = {"'', hexadecimal value 0x0B, is an invalid character. Line 2, position 1."}

In debug mode I can see the rogue character (sorry I'm not sure of it's official title?):

My questions is how can I sanitise the xml string before I attempt to load it into the XMLDocument object? Do I need a custom function to parse out all these sorts of characters one-by-one or can I use some native .NET4 class to remove them?

Rogue character in debug mode

QFDev
  • 8,668
  • 14
  • 58
  • 85

2 Answers2

29

Here you have an example to clean xml invalid characters using Regex:

 xmlString = CleanInvalidXmlChars(xmlString);
 XmlDocument xmlDoc = new XmlDocument();
 xmlDoc.LoadXml(xmlString);

 public static string CleanInvalidXmlChars(string text)   
 {   
   string re = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]";   
   return Regex.Replace(text, re, "");   
 }   
Carlos Landeras
  • 11,025
  • 11
  • 56
  • 82
4

A more efficient way to not error out on invalid XML characters would be to use the CheckCharacters flag in XmlReaderSettings.

var xmlDoc = new XmlDocument();
var xmlReaderSettings = new XmlReaderSettings { CheckCharacters = false };
using (var stringReader = new StringReader(xml)) {
    using (var xmlReader = XmlReader.Create(stringReader, xmlReaderSettings)) {
        xmlDoc.Load(xmlReader);
    }
}
Charlie
  • 846
  • 7
  • 21
  • Isn't it little dangerous to leave illegal characters in XML? You should not save this as XML if there are illegal characters. You may want to save as plain text. Also the Doc says *"Character checking does not include checking for illegal characters in XML names or checking that all XML names are valid. These checks are part of conformance checking and are always performed."* [XmlWriterSettings.CheckCharacters Property](https://learn.microsoft.com/en-us/dotnet/api/system.xml.xmlwritersettings.checkcharacters?view=netstandard-2.0) – sk md Jul 21 '20 at 03:16
  • @sk-md The question asked how to avoid an error when loading an xml document with invalid characters. If the document is very large, it would be more efficient to remove invalid characters while reading it, instead of doing a sanitization first. – Charlie Jul 24 '20 at 20:57