Read text encoding issue

Question

I have read a certain plain text file (csv) and I have a problem with xA0

Visual Studio 2015:

Notepad++: (when setting char encoding to utf-8)

so it seems to be this non-breaking space so I tried this:

temp = temp.Replace("\xA0", string.Empty);

But it did not work and gave me the black squares similar to the first screenshot. I also changed

System.IO.StreamReader sr = new System.IO.StreamReader(csvFile.FileContent);

to use specific utf-8 encoding:

System.IO.StreamReader sr = new System.IO.StreamReader(csvFile.FileContent, System.Text.Encoding.UTF8);

both gave the same result. I really dislike char encoding and could use some help and explanation about my mistake.

edit added the notepad++ hex view: (to confirm it is the non-breaking char)

edit2 changing the streamreader constructor values to this:

System.IO.StreamReader sr = new System.IO.StreamReader(csvFile.FileContent, true);

results in an utf-8 encoding for reading the file. I tried to convert the latin1 to utf-8 but that gave me ??? https://stackoverflow.com/a/13999801/169714

Encoding.UTF8.GetString(Encoding.GetEncoding("iso-8859-1").GetBytes(temp))

_"I really dislike char encoding"_ - you can't dislike a fundamental concept like that. Read [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html). The problem you're seeing is that you're interpreting ANSI characters as Unicode. [The proper UTF-8 encoding for this character ("NO-BREAK SPACE") is the two-byte `0xC2 0xA0`](http://www.fileformat.info/info/unicode/char/00a0/index.htm). If that's not what's in the file, then the file is not UTF-8. — CodeCaster, Apr 12 '16 at 09:42
So open the file in a hex editor and take a look at the actual byte data. — CodeCaster, Apr 12 '16 at 09:46
Make sure the CSV is exported as UTF8 *with BOM*, or specify the *correct* encoding in the constructor. The reader doesn't try to guess which of the thousands of encodings matches the content. It will check for a BOM otherwise use the system locale's encoding. In fact, you *can't* guess the encoding without reading the entire file, or at least as much content as possible to eliminate the (thousands) of possibilities that fail mapping. And you'll still need a human, or a spell checker, to check the mapping results to find the most legible — Panagiotis Kanavos, Apr 12 '16 at 10:02
The "black squares" are actually a good thing - they are the Unicode replacement character used when an unknowned character is encountered. It means that your text is *definitely* not UTF8. 0xA0 is the non-breaking space in Latin1. Try passing `Encoding.GetEncoding("iso-8859-1")` as the encoding — Panagiotis Kanavos, Apr 12 '16 at 10:10
Thank you @PanagiotisKanavos But I have no influence on how the end-user exports it. I just need to import it. CodeCaster I have read that article before and will read it again. Thank you for your suggestion. — JP Hellemons, Apr 12 '16 at 10:11

score 0 · Answer 1 · answered Apr 12 '16 at 09:38

0

try putting the result into a string, reading the data and printing out the result

something like this:

string[] data = File.ReadAllLines(yourSavePath); 
File.WriteAllLines(yourSavePath, data);

if i am right it should fix it, it is a missing characters issue

answered Apr 12 '16 at 09:38

Nonagon

397
1
5
21

1

`File.ReadAllLines()` tries to guess the used encoding, and in cases like this, often gets it wrong. – CodeCaster Apr 12 '16 at 09:44
1

That's no different from what the OP did - ReadAllLines uses a Reader internally. The reader will check for a BOM, otherwise assume this is an ASCII file and use the system locale. It can't guess what the proper encoding is – Panagiotis Kanavos Apr 12 '16 at 10:04
@CodeCaster oops, the Reader will [read UTF8 by default](http://referencesource.microsoft.com/#mscorlib/system/io/streamreader.cs,133). The system locale must be specified using `Encoding.Default`. In many countries Latin1 *is* the system locale. – Panagiotis Kanavos Apr 12 '16 at 10:27
Good to know, thanks for the clarification guys. Added to my knowledge base – Nonagon Apr 12 '16 at 10:51

Panagiotis Kanavos · Answer 2 · 2016-04-12T10:34:40.067

0xA0 is the non-breaking space in Latin1 , iso-8859-1. You can read it by passing Encoding.GetEncoding("iso-8859-1") as the encoding:

var latin1= Encoding.GetEncoding("iso-8859-1");
var sr = new System.IO.StreamReader(csvFile.FileContent, latin1);

For example, for the input array:

byte[] values={0x53,0x34,0x35,0x3b,0x35,0x31,0xa0,0xa0,0xa0,0xa0,0xa0};

UTF8 returns

var s1=Encoding.UTF8.GetString(values);
Console.WriteLine(s1);

S45;51��

While Latin1 returns a valid string

var s2=latin1.GetString(values);
Console.WriteLine(s2);

S45;51

.NET uses Unicode for strings and text files are read using UTF8 by default. Eg, StreamReader's constructor defaults to UTF8:

    public StreamReader(Stream stream) 
        : this(stream, true) {
    }

    public StreamReader(Stream stream, bool detectEncodingFromByteOrderMarks) 
        : this(stream, Encoding.UTF8, detectEncodingFromByteOrderMarks, DefaultBufferSize, false) {
    }

To use the system locale, the Encoding.Default encoding must be passed explicitly.

var sr = new System.IO.StreamReader(csvFile.FileContent, Encoding.Default);

Many West European and English-speaking countries do use this encoding, so the system locale could be expected to be Latin1. This is a risky assumption to make in import jobs though

But can't this break other encoded files? I'd just like to remove that char. Or convert * to unicode or ansi — JP Hellemons, Apr 12 '16 at 10:19
First, .NET strings *are* Unicode. You got the *Unicode* replacement character � because the input couldn't be converted to Unicode. Second, if you don't want to handle individual encodings, ensure the files are exported as Unicode *with* BOM - that's why Unicode was created in the first place. Otherwise you *have* to specify the correct encoding, or ensure you system locale and the encodings match, and you pass [Encoding.Default](https://msdn.microsoft.com/en-us/library/system.text.encoding.default(v=vs.110).aspx) as the encoding — Panagiotis Kanavos, Apr 12 '16 at 10:24
@JPHellemons that is addressed by "there ain't no such thing as plain text" in the article I linked to. If you want to read a string into your program, the provider will have to specify the encoding of that string. Otherwise all bets are off. So while you can assume iso-8859-1 for _this_ file, that encoding may very well break the next file. — CodeCaster, Apr 12 '16 at 10:31

Read text encoding issue

2 Answers2