12

I created a sample app to load all special characters while copy pasting from Openoffice writer to Notepad. Double codes differs and when I try to load this.

var lines = File.ReadAllLines("..\\ter34.txt");

This creates problem of 65533 Issue comes and the text file contains:

This has been changed to the symbol:

Daniel A.A. Pelsmaeker
  • 47,471
  • 20
  • 111
  • 157
Aravind Srinivas
  • 251
  • 3
  • 8
  • 15
  • 2
    What encoding is the text file using? ANSI? ASCII? UTF8? UTF16? – Matthew Watson Feb 22 '13 at 10:43
  • Problem comes only in ANSI....rest of things working correctly it changes it to -- “ -- – Aravind Srinivas Feb 22 '13 at 10:52
  • 2
    Just to those who might not know. The `(char)65533` is also known as U+FFFD and is a REPLACEMENT CHARACTER. This is often emitted when the data to be converted is corrupt, or when the encoding to convert into can't represent the correct character. See [Wikipedia](http://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character). – Jeppe Stig Nielsen Feb 22 '13 at 10:53

1 Answers1

26

U+FFFD is the "Unicode replacement character", which is used if the data you try to read is invalid for the encoding which is being used to convert binary data to text.

For example, if you write a file out using ISO-8859-1, but then try to read it using UTF-8, then you could easily end up with some byte sequences which simply aren't valid UTF-8. Each invalid byte would be translated (by default) into U+FFFD.

Basically, you need to provide the right encoding to File.ReadAllLines, as a second argument. That means you need to know the encoding of the file first, of course.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Oddly enough, I always thought that this is just custom feature data streaming/transcoding library. And it is well-defined Unicode transcoding behavior? Great! – quetzalcoatl Feb 22 '13 at 10:52
  • When i save the txt file in formats like UTF8,Unicode ..etc its working correctly but when i save it in ANSI .. then that symbol comes – Aravind Srinivas Feb 22 '13 at 10:54
  • 1
    Unicode files can present many different characters, while ANSI - dependents on selected CodePage, and usually far less. When you try to save some 'extended' character to ANSI file, you have some chances that this character simply cannot be translated to that ANSI CodePage you have selected (or defaulted to). In such cases, three things could happen: an exception could be thrown and crash everythin so you see there's a problem, OR those characters could be silently skipped (eeviill), OR, some "replacement character" is written to file instead so you see there's a problem – quetzalcoatl Feb 22 '13 at 10:59
  • 1
    @user2046631: Right, so when you *read* the file you need to specify that encoding too. "ANSI" isn't a single encoding though - it's a broad term used for lots of encodings. You'll need to find out which one you actually mean. – Jon Skeet Feb 22 '13 at 11:00
  • @user2046631 You can possibly use `File.ReadAllLines(@"..\ter34.txt", Encoding.GetEncoding("Windows-1252"))` if the text file is in "Windows (Western European)" kind of ANSI. To rely on the ANSI of your own machine, use `File.ReadAllLines(@"..\ter34.txt", Encoding.Default)`. – Jeppe Stig Nielsen Feb 22 '13 at 11:08
  • I opened a Notepad and saved the file with the above character and Encoding is Default chosen as ANSI ... if i change to others then its working.. i tried only four formats which is default coming in Notepad Encoding – Aravind Srinivas Feb 22 '13 at 11:11
  • Is it Issue about Encoding Format or Illegal Character Encoding – Aravind Srinivas Feb 22 '13 at 11:14
  • @user2046631: You need to work out *exactly* what Notepad means by "ANSI" and then specify that in the call to `ReadAllLines`, basically... if you want to support ANSI. (It's not clear what your actual use case is.) If you can just state that you only support UTF-8, that would make things a lot easier. – Jon Skeet Feb 22 '13 at 11:43
  • if i use this File.ReadAllLines(@"..\ter34.txt", Encoding.Default) then its working correctly .... this works in case of ANSI also...but i am using Thirdparty Dll Named GEMBOXSPREADSHEET.. in that there is a LoadCsv(filepath,char) .. this fails in ANSI Encoding.. but this works for UTF8,UNICODE,..etc – Aravind Srinivas Feb 22 '13 at 11:45
  • @user2046631: On your computer, yes - but quite possibly not if you send that file to someone else, who has a different default encoding. Anyway, I'm not sure these comments are productive any more, because we don't have clear requirements from you. I've explained why you're getting this problem - but without knowing more details about your situation, I can't offer a solution. – Jon Skeet Feb 22 '13 at 11:47