{"'\u0004', hexadecimal value 0x04, is an invalid character

Question

I am trying to convert a file to XML format that contains some special characters but it's not getting converted because of that special characters in the data. I have already this regex code still it's not working for me please help.

The code what I have tried:

string filedata = @"D:\readwrite\test11.txt";
string input = ReadForFile(filedata);
string re1 = @"[^\u0000-\u007F]+";
string re5 = @"\p{Cs}";
data = Regex.Replace(input, re1, "");   
data = Regex.Replace(input, re5, "");

XmlDocument xmlDocument = new XmlDocument();
try
{
   xmlDocument = (XmlDocument)JsonConvert.DeserializeXmlNode(data);
   var Xdoc = XDocument.Parse(xmlDocument.OuterXml);
}
catch (Exception ex)
{
   Console.WriteLine(ex);
}

https://stackoverflow.com/questions/4183766/replacing-all-non-ascii-characters-except-right-angle-character-in-c-sharp — Soundararajan, Sep 01 '18 at 06:25
Are these special characters only at the beginning of the file. If that is the case, then you should check for file with different encoding other than UTF-8. It could most probably be byte order marking, in which case you should use string inputwithoutspecialchars = System.Text.Encoding.UTF8.GetBytes(input) and use inputwithoutspecialchars during deserialization. — Soundararajan, Sep 01 '18 at 06:31
What is the content of the `ReadForFile` method? As the question @Soundarajan linked to points out, the source of the problem could be an incorrectly specified input encoding causing the input file to be misinterpreted. — Tom W, Sep 01 '18 at 07:05

score 1 · Answer 1 · answered Sep 01 '18 at 06:43

0x04 is a transmission control character and cannot appear in a text string. XmlDocument is right to reject it if it really does appear in your data. This does suggest that the regex you have doesn't do what you think it does, if I'm right that regex will find the first instance of one or more of those invalid characters at the beginning of a line and replace it, but not all of them. The real question for me is why this non-text 'character' appears in data intended as XML in the first place.

I have other questions. I've never seen JsonConvert.DeserializeXmlNode before - I had to look up what it does. Why are you using a JSON function against the root of a document which presumably therefore contains no JSON? Why are you then taking that document, converting it back to a string, and then creating an XDocument from it? Why not just create an XDocument to start with?

{"'\u0004', hexadecimal value 0x04, is an invalid character

1 Answers1