100

I have a string that contains invalid XML characters. How can I escape (or remove) invalid XML characters before I parse the string?

Igor Kustov
  • 3,228
  • 2
  • 34
  • 31
Alireza Noori
  • 14,961
  • 30
  • 95
  • 179
  • 2
    Could you provide more context? A sample input and a sample expected output. Also what do you intend to do with the output. – Darin Dimitrov Nov 30 '11 at 18:41
  • 5
    Are you writing the XML? Or are you trying to read XML that actually isn't XML? – Marc Gravell Nov 30 '11 at 18:42
  • 3
    Use an XmlWriter, it will escape the invalid characters for you – Thomas Levesque Nov 30 '11 at 18:43
  • 2
    @alireza you'll get more useful answers if you answer the questions people are asking you (for more information) here in the comments... – Marc Gravell Nov 30 '11 at 19:05
  • I'm sorry. I was away for a few hours. Please read the question that led to this one: http://stackoverflow.com/questions/8330619/xmldocument-loadxml-throws-an-exception-of-type-comexception/833100 You'll get all the info you need there – Alireza Noori Nov 30 '11 at 22:20
  • I should say that I'm reading XML data from a web page and it's in German and it contains some illegal characters in it – Alireza Noori Nov 30 '11 at 22:24
  • 1
    There is an ambiguity in the question: where is the string in XML? This matters because character restrictions different depending if it is an XML value or an XML name or yet something else. Also it matters to clarify which invalid characters you seek to protect against. It is just the 5 escaped characters (', ", &, < and >) or do you also have to deal with non-printable characters for instance? – David Burg Jan 25 '18 at 19:52
  • @ThomasLevesque XmlWriter will throw an exception when it encounters illegal characters, unless you change the default CheckCharacters setting to false. Then it will escape illegal characters. – Suncat2000 Mar 18 '22 at 13:18

8 Answers8

127

As the way to remove invalid XML characters I suggest you to use XmlConvert.IsXmlChar method. It was added since .NET Framework 4 and is presented in Silverlight too. Here is the small sample:

void Main() {
    string content = "\v\f\0";
    Console.WriteLine(IsValidXmlString(content)); // False

    content = RemoveInvalidXmlChars(content);
    Console.WriteLine(IsValidXmlString(content)); // True
}

static string RemoveInvalidXmlChars(string text) {
    var validXmlChars = text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray();
    return new string(validXmlChars);
}

static bool IsValidXmlString(string text) {
    try {
        XmlConvert.VerifyXmlChars(text);
        return true;
    } catch {
        return false;
    }
}

And as the way to escape invalid XML characters I suggest you to use XmlConvert.EncodeName method. Here is the small sample:

void Main() {
    const string content = "\v\f\0";
    Console.WriteLine(IsValidXmlString(content)); // False

    string encoded = XmlConvert.EncodeName(content);
    Console.WriteLine(IsValidXmlString(encoded)); // True

    string decoded = XmlConvert.DecodeName(encoded);
    Console.WriteLine(content == decoded); // True
}

static bool IsValidXmlString(string text) {
    try {
        XmlConvert.VerifyXmlChars(text);
        return true;
    } catch {
        return false;
    }
}

Update: It should be mentioned that the encoding operation produces a string with a length which is greater or equal than a length of a source string. It might be important when you store a encoded string in a database in a string column with length limitation and validate source string length in your app to fit data column limitation.

Igor Kustov
  • 3,228
  • 2
  • 34
  • 31
  • `XmlConvert.VerifyXmlChars` doesn't throw an exception if the argument contains invalid characters, it returns the null string (and returns the argument if all contained characters are valid). Try just `return XmlConvert.VerifyXmlChars (text) != null`. – Matt Enright Sep 25 '13 at 20:20
  • 3
    @Matt, no, it does - ["If any of the characters are not valid xml characters, an XmlException is thrown with information on the first invalid character encountered."](http://msdn.microsoft.com/en-us/library/system.xml.xmlconvert.verifyxmlchars.aspx) – Igor Kustov Oct 25 '13 at 06:36
  • 3
    @IgorKustov My bad! The return value documentation seems to contradict that, thanks for catching me out. – Matt Enright Oct 25 '13 at 16:10
  • 3
    Careful not to use XmlConvert.EncodeName if the string is meant for XML value. The XML name restrictions are stricter then XML value restrictions and the name encoding will lead to unnecessary unexpected escaping. – David Burg Jan 25 '18 at 19:56
  • I think this solution with throwing exceptions and catching them is going to cost on performance while parsing large xml files – arik Feb 19 '18 at 07:45
  • 1
    @arik my code serves only demonstrating purpose, to show a state of a XML string before and after transformation. Obviously, in your code you don't need to validate it. – Igor Kustov Feb 19 '18 at 07:56
  • @IgorKustov , sure, i've just make a point regarding the XmlConvert.VerifyXmlChars(text); method. which throwing excretions on invalid text, therefore will cost performance. I think the solution BLUEPIXY suggested below is better IMO – arik Feb 19 '18 at 08:50
73

Use SecurityElement.Escape

using System;
using System.Security;

class Sample {
  static void Main() {
    string text = "Escape characters : < > & \" \'";
    string xmlText = SecurityElement.Escape(text);
//output:
//Escape characters : &lt; &gt; &amp; &quot; &apos;
    Console.WriteLine(xmlText);
  }
}
gsamaras
  • 71,951
  • 46
  • 188
  • 305
BLUEPIXY
  • 39,699
  • 7
  • 33
  • 70
20

If you are writing xml, just use the classes provided by the framework to create the xml. You won't have to bother with escaping or anything.

Console.Write(new XElement("Data", "< > &"));

Will output

<Data>&lt; &gt; &amp;</Data>

If you need to read an XML file that is malformed, do not use regular expression. Instead, use the Html Agility Pack.

Community
  • 1
  • 1
Pierre-Alain Vigeant
  • 22,635
  • 8
  • 65
  • 101
11

The RemoveInvalidXmlChars method provided by Irishman does not support surrogate characters. To test it, use the following example:

static void Main()
{
    const string content = "\v\U00010330";

    string newContent = RemoveInvalidXmlChars(content);

    Console.WriteLine(newContent);
}

This returns an empty string but it shouldn't! It should return "\U00010330" because the character U+10330 is a valid XML character.

To support surrogate characters, I suggest using the following method:

public static string RemoveInvalidXmlChars(string text)
{
    if (string.IsNullOrEmpty(text))
        return text;

    int length = text.Length;
    StringBuilder stringBuilder = new StringBuilder(length);

    for (int i = 0; i < length; ++i)
    {
        if (XmlConvert.IsXmlChar(text[i]))
        {
            stringBuilder.Append(text[i]);
        }
        else if (i + 1 < length && XmlConvert.IsXmlSurrogatePair(text[i + 1], text[i]))
        {
            stringBuilder.Append(text[i]);
            stringBuilder.Append(text[i + 1]);
            ++i;
        }
    }

    return stringBuilder.ToString();
}
Francois C
  • 1,274
  • 2
  • 12
  • 14
9

Here is an optimized version of the above method RemoveInvalidXmlChars which doesn't create a new array on every call, thus stressing the GC unnecessarily:

public static string RemoveInvalidXmlChars(string text)
{
    if (text == null)
        return text;
    if (text.Length == 0)
        return text;

    // a bit complicated, but avoids memory usage if not necessary
    StringBuilder result = null;
    for (int i = 0; i < text.Length; i++)
    {
        var ch = text[i];
        if (XmlConvert.IsXmlChar(ch))
        {
            result?.Append(ch);
        }
        else if (result == null)
        {
            result = new StringBuilder();
            result.Append(text.Substring(0, i));
        }
    }

    if (result == null)
        return text; // no invalid xml chars detected - return original text
    else
        return result.ToString();

}
Akira Yamamoto
  • 4,685
  • 4
  • 42
  • 43
Urs Meili
  • 618
  • 7
  • 19
  • What is this `?.` syntax ? in line `result?.Append(ch);` ? – JB. Aug 28 '17 at 09:14
  • 2
    `?.` is the `Null-Conditional Operator`. https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/operators/null-conditional-operators – Pure.Krome Sep 13 '17 at 03:13
1

If you are only escaping invalid XML characters for a string that is used inside of an XML tag you could do something simple like this.

This works when you aren't using an XML library.

public string EscapeXMLCharacters (string target)
{
    return
        target
            .Replace("&", "&amp;")
            .Replace("<", "&lt;")
            .Replace(">", "&gt;")
            .Replace("\"", "&quot;")
            .Replace("'", "&apos;");
}

you could then call it like so:

public string GetXMLBody(string content)
{
    return @"<input>" + EscapeXMLCharacters(content) + "</input>";
}
Alexander Ryan Baggett
  • 2,347
  • 4
  • 34
  • 61
1
// Replace invalid characters with empty strings.
   Regex.Replace(inputString, @"[^\w\.@-]", ""); 

The regular expression pattern [^\w.@-] matches any character that is not a word character, a period, an @ symbol, or a hyphen. A word character is any letter, decimal digit, or punctuation connector such as an underscore. Any character that matches this pattern is replaced by String.Empty, which is the string defined by the replacement pattern. To allow additional characters in user input, add those characters to the character class in the regular expression pattern. For example, the regular expression pattern [^\w.@-\%] also allows a percentage symbol and a backslash in an input string.

Regex.Replace(inputString, @"[!@#$%_]", "");

Refer this too :

Removing Invalid Characters from XML Name Tag - RegEx C#

Here is a function to remove the characters from a specified XML string:

using System;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;

namespace XMLUtils
{
    class Standards
    {
        /// <summary>
        /// Strips non-printable ascii characters 
        /// Refer to http://www.w3.org/TR/xml11/#charsets for XML 1.1
        /// Refer to http://www.w3.org/TR/2006/REC-xml-20060816/#charsets for XML 1.0
        /// </summary>
        /// <param name="content">contents</param>
        /// <param name="XMLVersion">XML Specification to use. Can be 1.0 or 1.1</param>
        private void StripIllegalXMLChars(string tmpContents, string XMLVersion)
        {    
            string pattern = String.Empty;
            switch (XMLVersion)
            {
                case "1.0":
                    pattern = @"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|7F|8[0-46-9A-F]9[0-9A-F])";
                    break;
                case "1.1":
                    pattern = @"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|[19][0-9A-F]|7F|8[0-46-9A-F]|0?[1-8BCEF])";
                    break;
                default:
                    throw new Exception("Error: Invalid XML Version!");
            }

            Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
            if (regex.IsMatch(tmpContents))
            {
                tmpContents = regex.Replace(tmpContents, String.Empty);
            }
            tmpContents = string.Empty;
        }
    }
}
Community
  • 1
  • 1
Siva Charan
  • 17,940
  • 9
  • 60
  • 95
0
string XMLWriteStringWithoutIllegalCharacters(string UnfilteredString)
{
    if (UnfilteredString == null)
        return string.Empty;

    return XmlConvert.EncodeName(UnfilteredString);
}

string XMLReadStringWithoutIllegalCharacters(string FilteredString)
{
    if (UnfilteredString == null)
        return string.Empty;

    return XmlConvert.DecodeName(UnfilteredString);
}

This simple method replace the invalid characters with the same value but accepted in the XML context.


To write string use XMLWriteStringWithoutIllegalCharacters(string UnfilteredString).
To read string use XMLReadStringWithoutIllegalCharacters(string FilteredString).

Marco Concas
  • 1,665
  • 20
  • 25