Non English characters not preserved when rewriting text

Question

I've an issue on a customer site where lines containing words like "HabitaþÒo" get mangled on output. I'm processing a text file (pulling out selected lines and writing them to another file)

For diagnosis I've boiled the problem down to a file with just that bad word.

The original file contains no BOM but .net chooses to read it as UTF-8.

When read and written the word ends up looking like this "Habita��o".

A hex dump of the BadWord.txt file looks like this

enter image description here

Copying the file with this code

using (var reader = new StreamReader(@"C:\BadWord.txt"))
using (var writer = new StreamWriter(@"C:\BadWordReadAndWritten.txt"))
    writer.WriteLine(reader.ReadLine());

. . . gives . . .

enter image description here

Preserving the readers encoding doesn't do anything either

using (var reader = new StreamReader(@"C:\BadWord.txt"))
using (var writer = new StreamWriter(@"C:\BadWordReadAndWritten_PreseveEncoding.txt", false, reader.CurrentEncoding))
    writer.WriteLine(reader.ReadLine());

. . . gives . . . enter image description here

Any ideas what's going on here, how can I process this file and preserve the original text?

why don't you set the read Enconding? `System.IO.StreamReader(@"C:\BadWord.txt", System.Text.Encoding.Default)` — balexandre, Jan 08 '13 at 11:40
Will your program ever be run on a system in a different code page? If so, is the text data created on that system, or does it always come from a system with codepage 1252? — Matthew Watson, Jan 08 '13 at 12:28
In this instance the file is created on one server and consumed on another. This software has been running happily for over 10 years on multiple sites and has only recently given problems for one customer. Patching the code to take an option code page for that file will probably get us over this. Don't worry, I'm not going to hard code page numbers anywhere :) — Binary Worrier, Jan 08 '13 at 12:35

score 8 · Accepted Answer · answered Jan 08 '13 at 11:40

8

The only way to do it is to read the file in the same encoding, that it has been encoded in. This means Windows-1252:

Encoding enc = Encoding.GetEncoding(1252);
string correctText = File.ReadAllText(@"C:\BadWord.txt", enc);

answered Jan 08 '13 at 11:40

Esailija

138,174
23
272
326

sometimes (more that I like) `System.Text.Encoding.Default` is a better choice to use. – balexandre Jan 08 '13 at 11:42
3

@balexandre No it isn't, it has no relation to the file encoding, is not portable, and is always a bug. It's completely random what you get out of that, if I set my windows locale to arabic, it will mean that code will attempt to decode the file as Windows-1256, but the file *must* be decoded as Windows-1252. It makes absolutely no sense. – Esailija Jan 08 '13 at 11:43
@balexandre: According to the [docs](http://msdn.microsoft.com/en-us/library/system.text.encoding.default.aspx), that will return the "encoding for the operating system's current ANSI code page." Why would that be better than explicitly specifying the encoding of the file the OP is trying to read? – O. R. Mapper Jan 08 '13 at 11:44
in a Danish format, I keep getting errors on `Æ` `Ø` `Å` if I specify the encoding format, but I stop having them when using the `Default`... I got the `Default` from Jon Skeet answer somewhere around here, and since then, everything worked fine... But yes, my system is set to Danish local as well. (there's plenty more, with more details, but [here's one](http://stackoverflow.com/a/592938/28004) using `Default`) – balexandre Jan 08 '13 at 11:46
Usually when you are not using Unicode, it's because you are supporting some legacy data. Normally, that legacy data will have been written using the current ANSI code page. If you fix the codepage via GetEncoding(), you'll find that it won't work on systems that use a different code page. In those cases (just for supporting legacy apps) you have to use Encoding.Default. This also assumes that the legacy app writing the data is using the local ANSI code page. If this is NOT for a legacy app, then of course you should be using Unicode. – Matthew Watson Jan 08 '13 at 11:48
1

@balexandre Well you simply specified wrong encoding format then – Esailija Jan 08 '13 at 11:51
@Esailija: How do you know that codepage 1252 has been used? It could have been a different code page! As soon as you hard-code a codepage like that, the code won't work properly on a system with a different codepage. Hence, the use of Encoding.Default to work with legacy apps and data. – Matthew Watson Jan 08 '13 at 11:51
1

@MatthewWatson because the OP gave me the raw bytes and expected string, and when those bytes are decoded in 1252, it gives the expected string. 1252 is the only encoding where `0xFE` becomes `þ` AFAIK. – Esailija Jan 08 '13 at 11:52
@Esailija: Yes, for that particular data set. But how do you know that the OP's application isn't also deployed in other locales? If it is, then fixing the code page is wrong. – Matthew Watson Jan 08 '13 at 11:53
1

@MatthewWatson You'd have a point if this was a desktop application completely isolated from anything and only reads/writes files it itself has created. In the wild, you need to be told the encoding of input or guess it. – Esailija Jan 08 '13 at 11:55
Yes, but remember I'm specifically talking about supporting non-unicode legacy apps (probably written in C++/MFC). They almost always use the local code page. The main thing is that I know you are wrong when you say that using Encoding.Default is *always* a bug. For some legacy app support, it is the right thing to use (making the best of a bad situation, of course). – Matthew Watson Jan 08 '13 at 12:03
1

@MatthewWatson yeah I guess nothing is *always* when it comes to software :) – Esailija Jan 08 '13 at 12:36

score 0 · Answer 2 · answered May 26 '14 at 19:43

0

You should do a reader.Peek() before opening the StreamWriter. This reads the first character from the file to correctly detect the encoding without changing the current position.

answered May 26 '14 at 19:43

bmolsbeck

51
5

Non English characters not preserved when rewriting text

2 Answers2