0

As part of this answer, I determined that one of the things that can break a OLAP cube is feeding in values to it (in the dimension names/values/etc) that contain characters that are considered "InvalidXMLCharacters". Now I would like to filter out these values so that they never end up in the OLAP cubes I've building in SQL. Often I find myself importing this input data from one table into another. Something like the following:

INSERT INTO [dbo].[DestinationTableThatWillBeReferencedInMyOLAPCube]
SELECT TextDataColumn1, TextDataColumn2, etc... 
FROM [dbo].[SourceTableContainingColumnsWithValuesWithInvalidXMLCharacters]
WHERE XYZ...

Is there an efficient way to remove all "InvalidXMLCharacters" from my Columns in this query?

The obvious solution that comes to mind would be some sort of Regex, though from the previously stated linked posts, that might be quite complex, and I'm not sure of the performance implications around this.

Another idea I've had is to Convert the Columns to "XML" data type, but that will error if they contain invalid characters, not very helpful for removing them...

I've looked around and don't see many other cases where developers are trying to do exactly this, has thing been tackled any other way in another post that I haven't found?

David Rogers
  • 2,601
  • 4
  • 39
  • 84

1 Answers1

0

.NET CLR integration in SQL Server could be helpful. Here is a small c# example for you. You can use it as a starting point for your needs. Its most important line is using XmlConvert.IsXmlChar(ch) call to remove invalid XML characters.

c#

void Main()
{
    // https://www.w3.org/TR/xml/#charsets
    // ===================================
    // From xml spec valid chars:
    // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]    
    // any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.

    string content = "fafa\v\f\0";
    Console.WriteLine(IsValidXmlString(content)); // False

    content = RemoveInvalidXmlChars(content).Dump("Clean string");
    Console.WriteLine(IsValidXmlString(content)); // True   
}

// Define other methods and classes here
static string RemoveInvalidXmlChars(string text)
{
    return new string(text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray());
}

static bool IsValidXmlString(string text)
{
    bool rc = true;
    try
    {
        XmlConvert.VerifyXmlChars(text);
    }
    catch
    {
        rc = false;
    }

    return rc;
}
Yitzhak Khabinsky
  • 18,471
  • 2
  • 15
  • 21