How it is all setup:
- I receive a
byte[]
which contains CSV data - I don't know the encoding (should be unicode / utf8)
- I need to detect the encoding or fallback to a default (the text may contain umlauts, so the encoding is important)
- I need to read the header line and compare it with defined strings
After a short search I how to get a string out of the byte[]
I found How to convert byte[] to string? which stated to use something like
string result = System.Text.Encoding.UTF8.GetString(byteArray);
I (know) use this helper to detect the encoding and afterwards the Encoding.GetString
method to read the string like so:
string csvFile = TextFileEncodingDetector.DetectTextByteArrayEncoding(data).GetString(data);
But when I now try to compare values from this result
string with static strings in my code all comparisons fails!
// header is the first line from the string that I receive from EncodingHelper.ReadData(data)
for (int i = 0; i < headers.Count; i++) {
switch (headers[i].Trim().ToLower()) {
case "number":
// do
break;
default:
throw new Exception();
}
}
// where (headers[i].Trim().ToLower()) => "number"
While this seems to be a problem with the encoding of both strings my question is:
How can I detect the encoding of a string
from a byte[]
and convert it into the default encoding so that I am able to work with that string data?
Edit
The code supplied above was working as long the string data came from a file that was saved this way:
string tempFile = Path.GetTempFileName();
StreamReader reader = new StreamReader(inputStream);
string line = null;
TextWriter tw = new StreamWriter(tempFile);
fileCount++;
while ((line = reader.ReadLine()) != null)
{
if (line.Length > 1)
{
tw.WriteLine(line);
}
}
tw.Close();
and afterwards read out with
File.ReadAllText()
This
A. Forces the file to be unicode (ANSI format kills all umlauts)
B. requires the written file be accessible
Now I only got the inputStream
and tried what I posted above. And as I mentioned this worked before and the strings look identical. But they are not.
Note: If I use ANSI encoded file, which uses Encoding.Default
all works fine.
Edit 2
While ANSI encoded data work the UTF8 Encoded (notepadd++ only show UTF-8 not w/o BOM) start with char [0]: 65279
So where is my error because I guess System.Text.Encoding.UTF8.GetString(byteArray)
is working the right way.