Need to remove illegal characters in XML string

Question

I have to process xml data in C#, however, sometimes there is an illegal XML character present. For example this XML code will not parse as it is invalid:

<xml>Another way to write a heart is <3</xml>

The XML parser will throw an error because it is not valid, which makes sense. Although, I don't seem to find a way to replace that only one "<" to "& lt;" so that the parser will receive:

<xml>Another way to write a heart is &lt;3</xml>

Footnote: it can occure in any node in the xml which can be pretty large itself and like I said before, it happens not all the time...

Is there a function tthat can handle this?

Difficult really since the whole point of escaping invalid characters in XML is to prevent the output being invalid... have you no control over the producer of the XML? Regex could help here, since you could check for valid tag names (tag names can't start with a number so the above example could be fixable), etc. — Charleh, Dec 09 '16 at 12:52
The problem is, you're not working with XML. You're working with strings of text that somewhat resemble XML but haven't been correctly constructed according to the rules for XML. As such, don't be looking at XML tools to solve this problem. As Charleh suggests, the best fix is to get whoever/whatever is providing this input to you to switch to providing genuine XML to you instead. — Damien_The_Unbeliever, Dec 09 '16 at 12:56

score 3 · Answer 1 · edited May 23 '17 at 11:52

I am copy pasting from this previous answer by @IgorKustov, over here.

As the way to remove invalid XML characters I suggest you to use XmlConvert.IsXmlChar method. It was added since .NET Framework 4 and is presented in Silverlight too. Here is the small sample:

void Main() {
    string content = "\v\f\0";
    Console.WriteLine(IsValidXmlString(content)); // False

    content = RemoveInvalidXmlChars(content);
    Console.WriteLine(IsValidXmlString(content)); // True
}

static string RemoveInvalidXmlChars(string text) {
    var validXmlChars = text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray();
    return new string(validXmlChars);
}

static bool IsValidXmlString(string text) {
    try {
        XmlConvert.VerifyXmlChars(text);
        return true;
    } catch {
        return false;
    }
}

And as the way to escape invalid XML characters I suggest you to use XmlConvert.EncodeName method. Here is the small sample:

void Main() {
    const string content = "\v\f\0";
    Console.WriteLine(IsValidXmlString(content)); // False

    string encoded = XmlConvert.EncodeName(content);
    Console.WriteLine(IsValidXmlString(encoded)); // True

    string decoded = XmlConvert.DecodeName(encoded);
    Console.WriteLine(content == decoded); // True
}

static bool IsValidXmlString(string text) {
    try {
        XmlConvert.VerifyXmlChars(text);
        return true;
    } catch {
        return false;
    }
}

Update: It should be mentioned that the encoding operation produces a string with a length is greater or equal than a length of a source string. It can be important when you store a encoded string in a database in a string column with length limitation and validate source string length in your app to fit data column limitation.

score 2 · Accepted Answer · answered Dec 09 '16 at 12:58

2

There is no general solution to this, because you have no way of determining whether:

<xml>You can use <b></b> to highlight stuff in HTML.</xml>.

is a "mistake" and should actually be encoded:

<xml>You can use &lt;b&gt;&lt;/b&gt; to highlight stuff in HTML.</xml>.

or not.

Thus, since there is no general solution, you can only use imperfect heuristics to detect such issues.

There is no built-in heuristic in the C# BCL, you will have to roll your own or find some external library. A simple heuristic, for example, would be to find all < which are not followed by [/a-zA-Z0-9]+> and escape them.

Heuristics are intrinsically imperfect, so if you have the opportunity to fix the system creating those broken looks-like-XML-but-isn't files, this would be a much better solution.

answered Dec 09 '16 at 12:58

Heinzi

167,459
57
363
519

1

This was the anwser I used to solve, at this point I can match the wrong XML-characters with this REGEX expression: <(?![/a-zA-Z0-9]+>) Think I will just add more expressions when I encounter other situations... Thanks! – stijnpiron Dec 09 '16 at 13:10
Expanded the regex expression to match : <(?![/a-zA-Z0-9]*[_/a-zA-Z0-9]*>) – stijnpiron Dec 09 '16 at 13:52
@stijnpiron: `[/a-zA-Z0-9]*[_/a-zA-Z0-9]*` is semantically equivalent to `[_/a-zA-Z0-9]*`. – Heinzi Dec 09 '16 at 14:04
@Heinzi: No, it is not. The former restricts underscore from being in the first place, the latter doesn't. – Otto Abnormalverbraucher Nov 16 '17 at 12:24
@OttoAbnormalverbraucher: It would, if the first quantifier were a `+` (or missing) instead of a `*`. As it stands now, `_abc` is a valid match, consisting of 0 times the first character group and four times the second character group. – Heinzi Nov 16 '17 at 12:45

score 0 · Answer 3 · edited May 23 '17 at 12:24

Check this link you could use regex to repair the xml string. This is the code from the link:

public static String repair(String xml) {
    Pattern pattern = Pattern.compile("(<attribute name=\"[^\"]+\">)(.*?)(</attribute>)");
    Matcher m = pattern.matcher(xml);
    StringBuffer buf = new StringBuffer(xml.length() + xml.length() / 32);
    while (m.find()) {
        String escaped = StringEscapeUtils.escapeXml(m.group(2));
        m.appendReplacement(buf, m.group(1) + escaped + m.group(3));
    }
    m.appendTail(buf);
    return buf.toString();
}

Depending on the size of your xml string the performance could be an issue. But atleast in my knowledge there is no parser that can read xml with illegal chars and remove them.

Need to remove illegal characters in XML string

3 Answers3