4

Getting an error creating XML

Msg 6841, Level 16, State 1, Line 26 FOR XML could not serialize the data for node 'value' because it contains a character (0x000C) which is not allowed in XML. To retrieve this data using FOR XML, convert it to binary, varbinary or image data type and use the BINARY BASE64 directive.

Figured out how to fix that with TSQL

My question is how to prevent it

This data is loaded via .NET C#
Already doing some clean up like:
- remove leading and trailing spaces
- concatenate multiple spaces to a single space

What characters will break FOR XML?

How to identify and remove those characters in .NET C#?
On input before the data even gets into SQL.

The XML is generated with a TSQL FOR XML (not via .NET).

Found this link Valid characters in XML

Unicode code points in the following code point ranges are always valid in XML 1.1 documents:[2] U+0001–U+D7FF, U+E000–U+FFFD: this includes most C0 and C1 control characters, but excludes some (not all) non-characters in the BMP (surrogates, U+FFFE and U+FFFF are forbidden); U+10000–U+10FFFF: this includes all code points in supplementary planes, including non-characters.

I don't know how to test for U+0001–U+D7FF.

The answer is has more than the question.
As stated in the question I was already performing other input filtering.
I just wanted to add xml.
In the actual app will filter all out all control characters as this user data should not have any control characters.
The win1252 part is to align with data stored int SQL char (byte).

Went the 1.0 character set as what was breaking my FOR XML is allowed in 1.1.
Also only apply only up to Int16 as char is Int16 in .NET.

public static string RemoveDiatricsXMLsafe(string unicodeString, bool toLower, bool toWin1252)
{
    // cleary could just create the Regex and validXMLsingle once in the ctor
    unicodeString = Regex.Replace(unicodeString, @"\s{2,}", " ");
    //U+0009, U+000A, U+000D: these are the only C0 controls accepted in XML 1.0;
    //U+0020–U+D7FF, U+E000–U+FFFD    
    Int16[] validXMLsingle = new Int16[4];
    validXMLsingle[0] = Int16.Parse("0020", System.Globalization.NumberStyles.HexNumber);
    validXMLsingle[1] = Int16.Parse("0009", System.Globalization.NumberStyles.HexNumber);
    validXMLsingle[2] = Int16.Parse("000A", System.Globalization.NumberStyles.HexNumber);
    validXMLsingle[3] = Int16.Parse("000D", System.Globalization.NumberStyles.HexNumber);

    unicodeString = unicodeString.Trim();
    Int16 u16;
    StringBuilder sb = new StringBuilder();
    bool validXML = false;
    if (toLower) unicodeString = unicodeString.ToLowerInvariant();
    foreach (char c in unicodeString.Normalize(NormalizationForm.FormD)) // : NormalizationForm.FormKD) breaks 
    {
        switch (CharUnicodeInfo.GetUnicodeCategory(c))
        {
            case UnicodeCategory.NonSpacingMark:
            case UnicodeCategory.SpacingCombiningMark:
            case UnicodeCategory.EnclosingMark:
                //do nothing
                break;
            default:
                u16 = (Int16)c;
                validXML = false; 
                if      (u16 >= validXMLsingle[0]) validXML = true;
                else if (u16 == validXMLsingle[1]) validXML = true;
                else if (u16 == validXMLsingle[2]) validXML = true;
                else if (u16 == validXMLsingle[3]) validXML = true;
                if (validXML) sb.Append(c);
                break;
        }
    }
    if (!toWin1252)
    {
        return sb.ToString();
    }
    else
    {
        Encoding win1252 = Encoding.GetEncoding("Windows-1252");
        Encoding unicode = Encoding.Unicode;

        // Convert the string into a byte array. 
        byte[] unicodeBytes = unicode.GetBytes(sb.ToString());

        // Perform the conversion from one encoding to the other. 
        byte[] win1252Bytes = Encoding.Convert(unicode, win1252, unicodeBytes);

        // Convert the new byte[] into a char[] and then into a string. 
        char[] win1252Chars = new char[win1252.GetCharCount(win1252Bytes, 0, win1252Bytes.Length)];
        win1252.GetChars(win1252Bytes, 0, win1252Bytes.Length, win1252Chars, 0);
        return new string(win1252Chars);
        //string win1252String = new string(win1252Chars);
        //return win1252String;
    }
}
paparazzo
  • 44,497
  • 23
  • 105
  • 176
  • possible duplicate of [XML Clean up (remove invalid characters from attribute value)](http://stackoverflow.com/questions/11241977/xml-clean-up-remove-invalid-characters-from-attribute-value) – Chris Gessler Apr 17 '13 at 12:25
  • @ChrisGessler That link shows a method to remove invalid character but it does not identify which characters are invalid. – paparazzo Apr 17 '13 at 12:37
  • 4 votes to close? It is not a duplicate. The other question does NOT identify WHICH characters to remove. And does not have an accepted answer. – paparazzo Apr 18 '13 at 00:03
  • If your data contain invalid characters, then removing them will damage your data. So don't remove them. Just base64-encode the whole string. – John Saunders Mar 11 '14 at 23:08

2 Answers2

1

I'm guessing that '0x000C' = '<' (http://www.wimpyplayer.com/docs/howto/special_characters.html). So don't you just need to XML escape the data you put into each node before you insert it?

This is answered here: String escape into XML

public static string XmlEscape(string unescaped)
{
    XmlDocument doc = new XmlDocument();
    var node = doc.CreateElement("root");
    node.InnerText = unescaped;
    return node.InnerXml;
}

public static string XmlUnescape(string escaped)
{
    XmlDocument doc = new XmlDocument();
    var node = doc.CreateElement("root");
    node.InnerXml = escaped;
    return node.InnerText;
}
Community
  • 1
  • 1
satnhak
  • 9,407
  • 5
  • 63
  • 81
1

On the .Net side, you should be able to use a regular expression to see whether you've got an odd bird:

var reg = new Regex("[^[\u0001-\ud7ff\ue000-\ufffd)]");
if(reg.IsMatch(...)
{
    // do what you want if you find something you don't want
}
JAQFrost
  • 1,431
  • 8
  • 8