Encoding detection for a string-data in a byte[] succeed and after that all string comparisons failed

Question

How it is all setup:

I receive a byte[] which contains CSV data
I don't know the encoding (should be unicode / utf8)
I need to detect the encoding or fallback to a default (the text may contain umlauts, so the encoding is important)
I need to read the header line and compare it with defined strings

After a short search I how to get a string out of the byte[] I found How to convert byte[] to string? which stated to use something like

string result = System.Text.Encoding.UTF8.GetString(byteArray);

I (know) use this helper to detect the encoding and afterwards the Encoding.GetString method to read the string like so:

string csvFile = TextFileEncodingDetector.DetectTextByteArrayEncoding(data).GetString(data);

But when I now try to compare values from this result string with static strings in my code all comparisons fails!

// header is the first line from the string that I receive from EncodingHelper.ReadData(data)
for (int i = 0; i < headers.Count; i++) {
    switch (headers[i].Trim().ToLower()) {
        case "number":
            // do
            break;
        default:
            throw new Exception();
    }
}
// where (headers[i].Trim().ToLower()) => "number"

While this seems to be a problem with the encoding of both strings my question is:

How can I detect the encoding of a string from a byte[] and convert it into the default encoding so that I am able to work with that string data?

Edit

The code supplied above was working as long the string data came from a file that was saved this way:

string tempFile = Path.GetTempFileName();
StreamReader reader = new StreamReader(inputStream);
string line = null;
TextWriter tw = new StreamWriter(tempFile);
fileCount++;

while ((line = reader.ReadLine()) != null)
{
    if (line.Length > 1)
    {
        tw.WriteLine(line);
    }
}
tw.Close();

and afterwards read out with

File.ReadAllText()

This

A. Forces the file to be unicode (ANSI format kills all umlauts)

B. requires the written file be accessible

Now I only got the inputStream and tried what I posted above. And as I mentioned this worked before and the strings look identical. But they are not.

Note: If I use ANSI encoded file, which uses Encoding.Default all works fine.

Edit 2

While ANSI encoded data work the UTF8 Encoded (notepadd++ only show UTF-8 not w/o BOM) start with char [0]: 65279

So where is my error because I guess System.Text.Encoding.UTF8.GetString(byteArray) is working the right way.

A little more details... What is the real encoding of the CSV? Try opening with Notepad++ and look at the Format: is it UTF8? UTF8 without BOM? Ansi? Your code is only trying to find the BOM at the beginnign of the file, but many UTF8/Unicode files are without BOM... And what does headers[i] contains in the end? You are a little "light" on details — xanatos, May 20 '15 at 11:32
What does the debugger show for the actual characters making up `headers[i]` for the `i` you're expecting to match `"number"` ? — AakashM, May 20 '15 at 11:33
@xanatos Notepad shows UTF8 while headers[i] contains "Number" and (headers[i].Trim().ToLower()) => "number" so optical the same — sra, May 20 '15 at 11:39
@sra So the problem isn't in the encoding... You'll have some bug around the code :-) Put a breakpoint in the `for` cycle and watch — xanatos, May 20 '15 at 11:40
(as a sidenote, using `ToLower()` in that way is wrong... At least use `ToLowerInvariant()`. See http://codeblog.jonskeet.uk/2009/11/02/omg-ponies-aka-humanity-epic-fail/, search for *To prove this isn’t just a problem for toy examples*) — xanatos, May 20 '15 at 11:43
@xanatos I just have to rewrite this and it was working as long as the data came from a file that was read with `File.ReadAllText`.... I will edit the question to add some more information. — sra, May 20 '15 at 11:51
"it shows "number" so optical the same" - set a watch on, or evaluate in the Immediate window, `.ToCharArray()` on your string that you think is `"header"`. The results may be interesting. — AakashM, May 20 '15 at 11:52
The original code you've editted in just treats the data as UTF-8. So if that's good enough for you, just use UTF-8. — Luaan, May 20 '15 at 12:02
@AakashM yes, that seems to be the problem: char [0]: 65279 '' but where exactly is my error — sra, May 20 '15 at 12:08
If the encoding is important you should not guess. Ask the people/company that produces the file which encoding they used. Don't guess. Most encodings can't be detected at all. — Lasse V. Karlsen, May 20 '15 at 12:38

score 3 · Accepted Answer · edited May 23 '17 at 12:30

3

Yes, Encoding.GetString doesn't strip the BOM (see https://stackoverflow.com/a/11701560/613130). You could:

string result;

using (var memoryStream = new MemoryStream(byteArray))
{
    result = new StreamReader(memoryStream).ReadToEnd();
}

The StreamReader will autodetect the encoding (your encoding detector is a copy of the StreamReader.DetectEncoding())

edited May 23 '17 at 12:30

Community

1
1

answered May 20 '15 at 12:31

xanatos

109,618
12
197
280

Encoding detection for a string-data in a byte[] succeed and after that all string comparisons failed

1 Answers1