1

How it is all setup:

  • I receive a byte[] which contains CSV data
  • I don't know the encoding (should be unicode / utf8)
  • I need to detect the encoding or fallback to a default (the text may contain umlauts, so the encoding is important)
  • I need to read the header line and compare it with defined strings

After a short search I how to get a string out of the byte[] I found How to convert byte[] to string? which stated to use something like

string result = System.Text.Encoding.UTF8.GetString(byteArray);

I (know) use this helper to detect the encoding and afterwards the Encoding.GetString method to read the string like so:

string csvFile = TextFileEncodingDetector.DetectTextByteArrayEncoding(data).GetString(data);

But when I now try to compare values from this result string with static strings in my code all comparisons fails!

// header is the first line from the string that I receive from EncodingHelper.ReadData(data)
for (int i = 0; i < headers.Count; i++) {
    switch (headers[i].Trim().ToLower()) {
        case "number":
            // do
            break;
        default:
            throw new Exception();
    }
}
// where (headers[i].Trim().ToLower()) => "number"

While this seems to be a problem with the encoding of both strings my question is:

How can I detect the encoding of a string from a byte[] and convert it into the default encoding so that I am able to work with that string data?


Edit

The code supplied above was working as long the string data came from a file that was saved this way:

string tempFile = Path.GetTempFileName();
StreamReader reader = new StreamReader(inputStream);
string line = null;
TextWriter tw = new StreamWriter(tempFile);
fileCount++;

while ((line = reader.ReadLine()) != null)
{
    if (line.Length > 1)
    {
        tw.WriteLine(line);
    }
}
tw.Close();

and afterwards read out with

File.ReadAllText()

This

A. Forces the file to be unicode (ANSI format kills all umlauts)

B. requires the written file be accessible

Now I only got the inputStream and tried what I posted above. And as I mentioned this worked before and the strings look identical. But they are not.

Note: If I use ANSI encoded file, which uses Encoding.Default all works fine.


Edit 2

While ANSI encoded data work the UTF8 Encoded (notepadd++ only show UTF-8 not w/o BOM) start with char [0]: 65279

So where is my error because I guess System.Text.Encoding.UTF8.GetString(byteArray) is working the right way.

Community
  • 1
  • 1
sra
  • 23,820
  • 7
  • 55
  • 89
  • 1
    A little more details... What is the real encoding of the CSV? Try opening with Notepad++ and look at the Format: is it UTF8? UTF8 without BOM? Ansi? Your code is only trying to find the BOM at the beginnign of the file, but many UTF8/Unicode files are without BOM... And what does headers[i] contains in the end? You are a little "light" on details – xanatos May 20 '15 at 11:32
  • What does the debugger show for the actual characters making up `headers[i]` for the `i` you're expecting to match `"number"` ? – AakashM May 20 '15 at 11:33
  • @AakashM it shows "number" so optical the same. – sra May 20 '15 at 11:37
  • 2
    What's the value of (int)str[0]? Probably an UTF BOM. – usr May 20 '15 at 11:38
  • @xanatos Notepad shows UTF8 while headers[i] contains "Number" and (headers[i].Trim().ToLower()) => "number" so optical the same – sra May 20 '15 at 11:39
  • @sra So the problem isn't in the encoding... You'll have some bug around the code :-) Put a breakpoint in the `for` cycle and watch – xanatos May 20 '15 at 11:40
  • (as a sidenote, using `ToLower()` in that way is wrong... At least use `ToLowerInvariant()`. See http://codeblog.jonskeet.uk/2009/11/02/omg-ponies-aka-humanity-epic-fail/, search for *To prove this isn’t just a problem for toy examples*) – xanatos May 20 '15 at 11:43
  • @xanatos I just have to rewrite this and it was working as long as the data came from a file that was read with `File.ReadAllText`.... I will edit the question to add some more information. – sra May 20 '15 at 11:51
  • 2
    "it shows "number" so optical the same" - set a watch on, or evaluate in the Immediate window, `.ToCharArray()` on your string that you think is `"header"`. The results may be interesting. – AakashM May 20 '15 at 11:52
  • The original code you've editted in just treats the data as UTF-8. So if that's good enough for you, just use UTF-8. – Luaan May 20 '15 at 12:02
  • 1
    @AakashM yes, that seems to be the problem: char [0]: 65279 '' but where exactly is my error – sra May 20 '15 at 12:08
  • May be you forgot to skip the header, if it's exists? – Mark Shevchenko May 20 '15 at 12:14
  • If the encoding is important you should not guess. Ask the people/company that produces the file which encoding they used. Don't guess. Most encodings can't be detected at all. – Lasse V. Karlsen May 20 '15 at 12:38

1 Answers1

3

Yes, Encoding.GetString doesn't strip the BOM (see https://stackoverflow.com/a/11701560/613130). You could:

string result;

using (var memoryStream = new MemoryStream(byteArray))
{
    result = new StreamReader(memoryStream).ReadToEnd();
}

The StreamReader will autodetect the encoding (your encoding detector is a copy of the StreamReader.DetectEncoding())

Community
  • 1
  • 1
xanatos
  • 109,618
  • 12
  • 197
  • 280