1

I copied and pasted a certain source code into my program with a text editor. I basically need to confirm that the source code begins with "int main()" so I went ahead and compared line with "int main()" but the comparison always returned false.

I decided to strip the string into characters and found something weird.

enter image description here

so string line has "int main()" passed inside it which is the text that has been pasted inside the text editor. You would think a and b would have the same characters, but they don't:

enter image description here

I'm honestly not sure where is that quotation mark in the beginning coming from. The original string didn't contain it, the debugger doesn't show it (It would display "\"int main()\"" otherwise). What is happening here?

Edit: I tried line = line.Trim(). Still that character is not gone. Apparently it's some special unicode character for Zero width no-break space. How can I remove this from my string?

TtT23
  • 6,876
  • 34
  • 103
  • 174
  • 1
    Include `line` assignment statement in your post. I think it should be - `line="\"int main()\"";` – KV Prajapati Oct 15 '12 at 04:17
  • @AVD that's just the thing, it's not. The debugger tooltip that you see above is exactly what is passed inside line. – TtT23 Oct 15 '12 at 04:18
  • 5
    Did you copy and paste from the internet? That Unicode character code (65279) corresponds to a zero-width space which would be difficult to discern visually. see http://www.fileformat.info/info/unicode/char/feff/index.htm – Mike Zboray Oct 15 '12 at 04:19
  • You shouldn't pass in `"int main()"`, just type in `int main()` and it will use that as a string. Passing in the quotations turns out to be a part of the string itself. – SimpleVar Oct 15 '12 at 04:19
  • Check this for more explanation - http://stackoverflow.com/questions/6784799/what-is-this-char-65279 – rs. Oct 15 '12 at 04:23
  • Your problem starts where the zero-width char is inserted into the string, not when you want to pull it out. Check why it is there in the first place. – SimpleVar Oct 15 '12 at 04:24
  • @YoryeNathan It is copied and pasted from some C source code. There's nothing I can do to remove it from the first place. I need to handle it in my application. – TtT23 Oct 15 '12 at 04:25
  • Then you can just do `line = line.Remove((char)65279)`, and remove every character that you don't allow. – SimpleVar Oct 15 '12 at 14:39

4 Answers4

2

65279 looks like the decimal representation of a UTF-16 BOM (U+FEFF), is it possible that the way you're reading the data into "line" would've failed to remove it?

Eric
  • 851
  • 7
  • 16
  • Ahh now I see what it is. As said above, I'm reading a string that has been copied and pasted from a source code into my texteditor so I have no way to alter the original text. Do I have a way of removing BOM from my string? – TtT23 Oct 15 '12 at 04:28
  • The safest way to remove it is just something like `if(line[0] == '\uFEFF') line = line.Substring(1);` – Eric Oct 15 '12 at 04:36
1

Could you set line to line.Trim(); It's hard to tell what might be going on without seeing how line is set.

update based on the BOM character: try line.Trim(new char[]{'\uFEFF'}); assuming .NET 4

ethorn10
  • 1,889
  • 1
  • 18
  • 29
0

I've found the solution:

private readonly string BYTE_ORDER_MARK_UTF8 = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());

...

if (line.StartsWith(BYTE_ORDER_MARK_UTF8))
                line = line.Remove(0, BYTE_ORDER_MARK_UTF8.Length);

That was bizzare...

TtT23
  • 6,876
  • 34
  • 103
  • 174
-1

In that code you have posted, it seems like the line variable begins with a space character. Try line = line.Trim();

Edit:

The reason the string.Trim() method is not working as expected can found on MSDN

Starting with the .NET Framework 4, the method trims all Unicode white-space characters (that is, characters that produce a true return value when they are passed to the Char.IsWhiteSpace method). Because of this change, the Trim method in the .NET Framework 3.5 SP1 and earlier versions removes two characters, ZERO WIDTH SPACE (U+200B) and ZERO WIDTH NO-BREAK SPACE (U+FEFF), that the Trim method in the .NET Framework 4 and later versions does not remove.

(U+FEFF) seems to be the character at the beginning of line, hence why Trim isn't dealing with it.

nick_w
  • 14,758
  • 3
  • 51
  • 71
  • Not only isn't that true, even if it were, `b` would be the one that would contain a space, which it doesn't. On top of that, `a` contains an empty string, not a space. – jdotjdot Oct 15 '12 at 04:18
  • it looks like it does. And char(65279) is a space isn't it? – Greg Oct 15 '12 at 04:20