C# remove not UTF-8 supported values from XML for XslCompiledTransform.Transform

Question

Every time I want to run XslCompiledTransform.Transform, I get an exception due to invalid characters. One of such characters is e.g. "xFFFE".

How can I remove all invalid characters in C#?

XmlConvert.IsXmlChar doesn't work because here I check every single char and "xFFFE" as single char is not an invalid char.

I run into an exception always in XslCompiledTransform.Transfor but only if "xFFFE" is in the XML doc.

Here is the code:

string document = "<?xml version=\"1.0\" encoding=\"utf-8\"?><FirstTag><Second><Third>;&#xFFFE;</Third></Second></FirstTag>";

public static string Clean(string document)
{
    XmlWriterSettings writerSettings = new XmlWriterSettings();

    XsltArgumentList argsList;
    document = RemoveXmlNotSupportedSigns(document);

    string result = "<?xml version=\"1.0\" encoding=\"utf-8\"?>";
    try
    {
        using (StringReader sr = new StringReader(document))
        {
            using (StringWriter sw = new StringWriter())
            {
                using (XmlReader xmlR = XmlReader.Create(sr))
                {
                    using (XmlWriter xmlW = XmlWriter.Create(sw, writerSettings))
                    {
                        Uri uri = new Uri(string.Format(CultureInfo.InvariantCulture, "{0}clean.xsl", Uri), UriKind.Relative);
                        argsList = new XsltArgumentList();

                        using (Stream xslSheet = Application.GetResourceStream(uri).Stream)
                        {
                            //Init resolver with the url of the recource path without filename
                            ResourceResolver resolver = new ResourceResolver(Uri);

                            using (XmlReader xmlReader = XmlReader.Create(xslSheet))
                            {
                                XsltSettings settings = new XsltSettings();
                                settings.EnableDocumentFunction = true;
                                // Transform
                                XslCompiledTransform.Load(xmlReader, settings, resolver);

                                XslCompiledTransform.Transform(xmlR, argsList, xmlW, resolver);
                            }
                        }
                    }
                }

                result = result + sw.ToString();
            }
        }
        return result;
    }
    catch (Exception Ex)
    {
        return result;
    }

}

If you're using an `XmlReader` to read the XML, see maybe [How to stop XMLReader throwing Invalid XML Character Exception](https://stackoverflow.com/q/26357994/3744182). [XMLReader Invalid XML Character Exception](https://stackoverflow.com/q/55651676/3744182) might also work. If neither work, can you [edit] your question to share a [mcve]? See: [ask]. — dbc, Nov 04 '21 at 23:42
0XFFFE is a "Byte Order Mark" (or BOM) indicating that the file is encoded as UTF-16, little-endian. I do not believe that it's valid in a well-formed XML file. If you read (or stream) the file into your C# program (using the correct encoding), the C# standard Framework classes, the BOM will get swallowed. Then you can pass the string or the stream to whatever XML code you are using — Flydog57, Nov 04 '21 at 23:45
While asking an XSLT question you need to provide a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example): (1) Input XML. (2) Your logic, and XSLT that tried to implement it. (3) Desired output, based on the sample XML in the #1 above. (4) XSLT processor and its compliance with the XSLT standards: 1.0, 2.0, or 3.0. — Yitzhak Khabinsky, Nov 05 '21 at 01:11

Martin Honnen · Accepted Answer · 2021-11-05T11:32:02.437

If you look at https://www.w3.org/TR/xml/#charsets you will find the allowed characters with the range [#xE000-#xFFFD] clearly not including #xFFFE. So this character is not part of well-formed XML 1.0 document, in your code sample it is not XslCompiledTransform or XSLT rejecting it, it is simply the underlying parser, XmlReader.

If you want to process such mal-formed input with XmlReader you can use the XmlReaderSettings with CheckCharacters = false and eliminate such characters, I think, by checking each with e.g. XmlConvert.IsXmlChar.

With the help of XmlWrappingReader from the MvpXml library (https://github.com/keimpema/Mvp.Xml.NetStandard) you could implement a filtering XmlReader:

public class MyWrappingReader : XmlWrappingReader
{
    public MyWrappingReader(XmlReader baseReader) : base(baseReader) { }

    public override string Value => base.NodeType == XmlNodeType.Text || base.NodeType == XmlNodeType.CDATA || base.NodeType == XmlNodeType.Attribute ? CleanString(base.Value) : base.Value;

    public override string ReadString()
    {
        if (base.NodeType == XmlNodeType.Text || base.NodeType == XmlNodeType.CDATA || base.NodeType == XmlNodeType.Attribute)
        {
            return CleanString(base.ReadString());
        }
        else
        {
            return base.ReadString();
        }
    }

    public override string GetAttribute(int i)
    {
        return CleanString(base.GetAttribute(i));
    }

    public override string GetAttribute(string localName, string namespaceUri)
    {
        return CleanString(base.GetAttribute(localName, namespaceUri));
    }

    public override string GetAttribute(string name)
    {
        return CleanString(base.GetAttribute(name));
    }

    private string CleanString(string input)
    {
        return string.Join("", input.ToCharArray().Where(c => XmlConvert.IsXmlChar(c)));
    }
}

Then use that reader to filter your input and XslCompiledTransform should work on the cleaned XML e.g. the following runs fine:

       string document = "<?xml version=\"1.0\" encoding=\"utf-8\"?><FirstTag><Second att1='value&#xFFFE;'><Third>a&#xFFFE;</Third></Second></FirstTag>";

        string xsltIndentity = @"<xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.0'><xsl:template match='@* | node()'><xsl:copy><xsl:apply-templates select='@* | node()'/></xsl:copy></xsl:template></xsl:stylesheet>";

        using (StringReader sr = new StringReader(document))
        {
            using (XmlReader xr = new MyWrappingReader(XmlReader.Create(sr, new XmlReaderSettings() { CheckCharacters = false })))
            {
                using (StringReader xsltSrReader = new StringReader(xsltIndentity))
                {
                    using (XmlReader xsltReader = XmlReader.Create(xsltSrReader))
                    {
                        XslCompiledTransform processor = new XslCompiledTransform();
                        processor.Load(xsltReader);
                        processor.Transform(xr, null, Console.Out);
                        Console.WriteLine();
                    }
                }
            }
        }

Yes I know And my question is how to remove all this chars in the xml document before — gacaba3982, Nov 05 '21 at 10:20
@gacaba3982, it is not an XML document if it contains such a character so don't expect to able to use XML APIs to do that. — Martin Honnen, Nov 05 '21 at 10:21
@gacaba3982, I have edited the answer and shown an example to preprocess the XML input using XmlWrappingReader wrapping an XmlReader with XmlReaderSettings not checking characters and then cleaning any text node, CDATA section or attribute value from any character not being accepted by XmlConvert.IsXmlChar. — Martin Honnen, Nov 05 '21 at 12:17

C# remove not UTF-8 supported values from XML for XslCompiledTransform.Transform

1 Answers1