UTF-8 to C# string from a DataReader

Question

I have a MS SQL query that returns a UTF-8 encoded field (a varchar). How do I convert it to a .NET Unicode string? UTF8Encoding.GetString will do the job ... if you have an array of bytes - I don't.

If you don't have an array of bytes, then what do you have? If it's a string, then it is has probably already been decoded by your database or ADO.NET provider. — Branko Dimitrijevic, Feb 21 '12 at 16:53
And what do you have? I don't think you need to convert SQL strings to .NET strings. — the_joric, Feb 21 '12 at 16:53
It's a varchar ... so its a string as far as the datareader is concerned — Rob, Feb 21 '12 at 17:27
@Rob And so should be presented to your C# code as a string. Why can't you just use that string? Is there a specific problem? — Branko Dimitrijevic, Feb 21 '12 at 19:05
Yes its a string - i.e. an ANSI string converted to Unicode. It isn't. Its a UTF-8 string and I need to get at it before it is converted. — Rob, Feb 21 '12 at 21:37

score 4 · Answer 1 · edited Aug 06 '12 at 11:06

I adapted what I had learned from the MSDN forums. The query returns a varchar with UTF-8 encoded text, datetime and int values. I'm building a CSV file from the results.

using (var sw = new StreamWriter(txtOutputFile, true, Encoding.UTF8))
{
  while (r.Read())
  {
    var colCount = r.FieldCount;
    var curCol = 1;

    var utf8 = new UTF8Encoding();
    var row = "";

    if (r[0].GetType() == typeof(SqlString))
    {
      SqlString sqlString = r.GetSqlString(0);
      Byte[] encodedBytes = sqlString.GetNonUnicodeBytes();
      row = "\"" + utf8.GetString(encodedBytes) + "\"";
    }
    else
    {
      row = "\"" + r[0].ToString() + "\"";
    }

    while (curCol < colCount)
    {
      if (r[curCol].GetType() == typeof(SqlString))
      {
        SqlString sqlString = r.GetSqlString(curCol);
        Byte[] encodedBytes = sqlString.GetNonUnicodeBytes();
        row += ",\"" + utf8.GetString(encodedBytes) + "\"";
      }
      else
      {
        row += ",\"" + r[curCol].ToString() + "\"";
      }
      curCol += 1;
    }

    sw.WriteLine(row);
  }

}

score 3 · Answer 2 · edited May 23 '17 at 12:08

If you have a C# string whose content is interpreted incorrectly as being UTF-16 encoded, then you could reinterpret the underlying bytes:

public static string InterpretAsUTF8(string value)
{
  byte[] rawData = Encoding.Default.GetBytes(value);
  string reencoded = Encoding.UTF8.GetString(rawData);
  return reencoded;
}

However, if you have access to the SqlDataReader, then as suggested by @CarlSteffen, you could do this:

//Assuming the column index you're interested in is 1:
SqlString utf8EncodedString = reader.GetSqlString(1);
string expectedResult = Encoding.UTF8.GetString(utf8EncodedString.GetNonUnicodeBytes());

Adriano Repetti · Answer 3 · 2015-04-13T09:10:40.063

1

Good starting point: http://msdn.microsoft.com/en-us/library/kdcak6ye.aspx To get the bytes of the original string you may use the System.Text.Encoding.UTF8.GetBytes() method.

edited Apr 13 '15 at 09:10

answered Feb 21 '12 at 16:56

Adriano Repetti

65,416
20
137
208

Hmm... you are suggesting to encode the string back to byte array and then decode it back to string again (and even transcode it from byte array to byte array by `Encoding.Convert` in the middle of all that)? What possible purpose would that serve? – Branko Dimitrijevic Feb 21 '12 at 17:11
I tried taking each character from the string and using it to set a byte array. That didn't work because high-bit characters had already been mapped to their Unicode equivalents. – Rob Feb 21 '12 at 17:29
@Rob _Why_ did you try that? Why can't you just use the string as-is? Is there a specific problem you are not telling us about? – Branko Dimitrijevic Feb 21 '12 at 19:08
Because the string 'as-is' is not the one in the database, it has been converted. For example (and I am making up the numbers), the UTF8 encoding for a Chinese character might be 0x80 0x99 0x14. However when ADO.NET (or the CLR) sees those three characters it automatically converts them into their Unicode representation. 0x99, the Windows character for the TM symbol gets converted to Unicode 0x8482. I cannot pass that to the UTF decoder. – Rob Feb 22 '12 at 00:05

score 1 · Accepted Answer · edited May 23 '17 at 12:30

1

Adapted a possible solution from this question: C# Convert string from UTF-8 to ISO-8859-1 (Latin1) H

var dbEnc = Encoding.UTF8;
var uniEnc = Encoding.Unicode;
byte[] dbBytes = dbEnc.GetBytes(dbString);
byte[] uniBytes = Encoding.Convert(dbEnc, uniEnc, dbBytes);
string msg = uniEnc.GetString(uniBytes);

Does this help?

edited May 23 '17 at 12:30

Community

1
1

answered Feb 21 '12 at 17:08

raveturned

2,637
24
30

I think the problem is that the string fetched from the provider has already been messed with. – Rob Feb 21 '12 at 18:14
1

Hmm. How about reading the data into a byte array using DbDataReader.GetBytes(…)? Also, do you have control over what's going into the database? It may be worth updating your question with more detail and context, such as example DB data, current output from your current code and the end result you want to achieve. – raveturned Feb 22 '12 at 09:38
I'll give this a go when I get back into the office. – Rob Feb 23 '12 at 11:46
It seems it simply cannot be done with a varchar field. Pretty amazing really. However, it can be done with a TEXT field and just casting is enough (i.e. cast(description as text) as description ). Once it is TEXT DbDataReader can get the raw string. – Rob Feb 27 '12 at 12:09
Interesting. As a heads-up, TEXT/NTEXT are being deprecated in future versions of SQL Server: http://msdn.microsoft.com/en-us/library/ms187993.aspx I tend to use nvarchar rather than varchar, does that behave any better? – raveturned Feb 28 '12 at 17:10

UTF-8 to C# string from a DataReader

4 Answers4