C# remove special characters from string

Question

I have the following string which represents an xml:

string xmlStr7 = "<?xml version=\"1.0\" encoding=\"utf-8\"?>\r\n<Response xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\">\r\n  <Market>en-US</Market>\r\n  <AnswerSet ID=\"0\">\r\n    <Answers>\r\n      <Answer ID=\"0\">\r\n        <Choices>\r\n          <Choice ID=\"2\" />\r\n          <Choice ID=\"8\" />\r\n        </Choices>\r\n      </Answer>\r\n      <Answer ID=\"1\">\r\n        <Choices>\r\n          <Choice ID=\"1\" />\r\n          <Choice ID=\"4\" />\r\n        </Choices>\r\n      </Answer>\r\n      <Answer ID=\"2\">\r\n        <Choices>\r\n          <Choice ID=\"1\" />\r\n          <Choice ID=\"7\" />\r\n        </Choices>\r\n      </Answer>\r\n      <Answer ID=\"3\">\r\n        <Choices>\r\n          <Choice ID=\"4\" />\r\n        </Choices>\r\n      </Answer>\r\n    </Answers>\r\n  </AnswerSet>\r\n</Response>";

I want to parse it into an XDocument object and in order to do so I must get rid of all the newlines and unnecessary spaces (otherwise I get a parsing error). I've removed the special characters manually and saw that the parsing works when I use the following string:

string xmlStr2 = "<?xml version=\"1.0\" encoding=\"utf-8\"?><Response xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"><Market>en-US</Market><AnswerSet ID=\"0\"><Answers><Answer ID=\"0\"><Choices><Choice ID=\"2\" /><Choice ID=\"8\" /></Choices></Answer><Answer ID=\"1\"><Choices><Choice ID=\"1\" /><Choice ID=\"4\" /></Choices></Answer><Answer ID=\"2\"><Choices><Choice ID=\"1\" /><Choice ID=\"7\" /></Choices></Answer><Answer ID=\"3\"><Choices><Choice ID=\"4\" /></Choices></Answer></Answers></AnswerSet></Response>";

I use the following code to achieve this programatically:

public static string replaceSubString(string st)
    {
        string pattern = ">\\s+<";
        string replacement = "><";
        Regex rgx = new Regex(pattern);
        string result = rgx.Replace(st, replacement);
        return result;
    }

By calling this method I expect to get a string that I will be able to parse to an XDocument object:

string newStr = replaceSubString(xmlStr7);
XDocument xmlDoc7 = XDocument.Parse(newStr);

However, this does not work. In addition, there seem to be a difference between this string and the string xmlStr2 from which I removed all the special characters manually (string.Compare returns false and newStr is longer in 1 char than xmlStr2). I can't see this difference by printing both strings, they seem identical.

The *only* problem with `xmlStr7` is the very first character (which doesn't even show up properly here). Remove that, and all is well. How are you reading this string in the first place? You don't need to remove the line breaks. — Jon Skeet, Apr 07 '14 at 19:55
I don't understand. The first character is "<" and it is identical in xmlStr7 and in xmlStr2 — user429400, Apr 07 '14 at 19:58
This is not a duplicate, in the previous question I was wondering why the parsing didn't work, here I'm asking why is my replaceSubString method returns a different result than xmlStr2 — user429400, Apr 07 '14 at 20:00
`XDocument xmlDoc7 = XDocument.Parse(xmlStr7.Substring(1));` As Jon Skeet said in the comment, there is a special character which is not visible. Hence the error. You can copy the string value from debugger and then paste it in notepad, you will be able to see the difference of font on ` — Habib, Apr 07 '14 at 20:00
@user429400 you have a BOM char at the start of that string. See http://stackoverflow.com/questions/6784799/what-is-this-char-65279 — Blorgbeard, Apr 07 '14 at 20:00
And no, the first character *isn't* `<`. Copy and paste the string into notepad, turn the status bar on, and look at the column number as you gradually press "right arrow" to move the cursor along the string... — Jon Skeet, Apr 07 '14 at 20:01
Try this: `foreach (char c in xmlStr7.Take(4)) Console.WriteLine(string.Format("\"{0}\" = {1}", c, (int)c));` — Blorgbeard, Apr 07 '14 at 20:02

score 2 · Accepted Answer · answered Apr 07 '14 at 20:04

Your string starts with a byte order mark (U+FEFF).

Ideally, you shouldn't get that into your string to start with, but if you do have it, you should just strip it:

string text = ...;
if (text.StartsWith("\ufeff"))
{
    text = text.Substring(1);
}
XDocument doc = XDocument.Parse(text);

Interestingly, XDocument.Load(Stream) can handle a BOM at the start of the data, but XDocument.Load(TextReader) can't. Presumably the expectation is that a reader will strip the BOM when it reads it anyway.

It's not clear where your data is coming from, but if you have it in a binary format somewhere (e.g. as a byte[] or a Stream) then I suggest you just load that instead of converting it to a string and then parsing the string. That will remove this problem and save you from the possibility of applying the wrong encoding.

C# remove special characters from string

1 Answers1