2

An ASP.NET page (ashx) receives a GET request with a UTF8 string. It reads a SqlServer database with Windows-1255 data.

I can't seem to get them to work together. I've used information gathered on SO (mainly Convert a string's character encoding from windows-1252 to utf-8) as well as msdn on the subject.

When I run anything through the functions below - it always ends up the same as it started - not converted at all.

Is something done wrong?

EDIT

What I'm specifically trying to do (getData returns a Dictionary<int, string>):

getData().Where(a => a.Value.Contains(context.Request.QueryString["q"]))

Result is empty, unless I send a "neutral" character such as "'" or ",".

CODE

    string windows1255FromUTF8(string p)
    {
        Encoding win = Encoding.GetEncoding(1255);
        Encoding utf8 = Encoding.UTF8;

        byte[] utfBytes = utf8.GetBytes(p);
        byte[] winBytes = Encoding.Convert(utf8, win, utfBytes);
        return win.GetString(winBytes);
    }

    string UTF8FromWindows1255(string p)
    {
        Encoding win = Encoding.GetEncoding(1255);
        Encoding utf8 = Encoding.UTF8;

        byte[] winBytes = win.GetBytes(p);
        byte[] utfBytes = Encoding.Convert(win, utf8, winBytes);
        return utf8.GetString(utfBytes);
    }
Community
  • 1
  • 1
JNF
  • 3,696
  • 3
  • 31
  • 64

2 Answers2

1

There is nothing wrong with the functions, they are simply useless.

What the functions do is to encode the strings into bytes, convert the data from one encoding to another, then decode the bytes back to a string. Unless the string contains a character that is not possible to encode using the windows-1255 encoding, the returned value should be identical to the input.

Strings in .NET doesn't have an encoding. If you get a string from a source where the text was encoded using for example UTF-8, once it's decoded into a string it doesn't have that encoding any more. You don't have to do anyting to a string to use it when the destination has a specific encoding, whatever library you are using that takes the string will take care of the encoding.

Guffa
  • 687,336
  • 108
  • 737
  • 1,005
  • When I `Response.Write` `context.Request.QueryString["q"]` and `getData()` (string from DB) they display differently. When I switch page encoding on the browser from UTF8 to windows-1255 and back one looks OK and the other is gibberish. If I understand you correctly - this shouldn't be true. – JNF Jan 13 '13 at 12:50
  • @JNF: Yes, that means that one of the strings is not decoded correctly. That needs to be fixed at the source, once it is decoded into a string it's too late to fix it, as the data is already lost. – Guffa Jan 13 '13 at 13:02
  • so how does it display corectly in the browser? If the right encoding is chosen, the browser shows the respective string fine. If data is lost it shouldn't be able to recover. – JNF Jan 13 '13 at 14:14
  • @JNF: There are some characters that can be decoded incorrectly and then recovered by encoding incorrectly also, so you seem to have been lucky with those specific characters, but it's not a reliable way to correct the problem. – Guffa Jan 13 '13 at 18:10
  • Thank you for the information in your answer. I posted the solution that worked at last. – JNF Jan 20 '13 at 08:03
0

For some reason this worked:

byte[] fromBytes = (fromEncoding.UTF8).GetBytes(myString);
string finalString = (Encoding.GetEncoding(1255)).GetString(fromBytes);

Switching encoding without the conversion...

JNF
  • 3,696
  • 3
  • 31
  • 64
  • What you are doing is wrong, and it will fail for some characters. If you seem to get the result by encoding and decoding using the wrong encoding, that means that the string was produced by decoding it using the wrong encoding to start with. To get it right you have to change how the string is created in the first place, not try to fix the string once it's already wrong. – Guffa Jan 20 '13 at 12:31
  • @Guffa I realize this is a hack, but so far it has worked and I found no other solution. I understood from you that .NET decodes the strings itself - how could I change the way it is created? – JNF Jan 21 '13 at 09:48
  • If .NET decodes the strings wrong, then it's because it has the wrong idea about how the content arrives. You would need to change the request encoding so that it corresponds with how the data actually is encoded. – Guffa Jan 21 '13 at 09:55
  • It comes from a `nvarchar` column in a table stored on a ms sqlserver. What might go wrong there? – JNF Jan 21 '13 at 12:12
  • If that value is wrong, I can't see any other reason that it was wrong already when it was stored in the database. Have you checked that the value from the query string is correct? – Guffa Jan 21 '13 at 13:08
  • I simply type it into the address bar on the browser. How wrong can that go? – JNF Jan 22 '13 at 21:04
  • There is no guarantee that the browser handles that correctly. You should check what the value is when it arrives on the server. – Guffa Jan 22 '13 at 23:33
  • What do I check it against. As far as I can tell both are valid strings, the just aren't compatible. – JNF Jan 23 '13 at 08:01
  • Check it against what you sent from the browser. – Guffa Jan 23 '13 at 08:11
  • I send 'q=ג' from the browser address bar. I tell the page to print `Request["q"]`. It looks fine when page is displayed utf8 and messed up when page is displayed win-1255. – JNF Jan 23 '13 at 10:20
  • The you have to check what the response encoding actually is. If it is win-1255, then the value is decoded wrong by the server, or encoded wrong by the browser. – Guffa Jan 23 '13 at 10:58
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/23217/discussion-between-jnf-and-guffa) – JNF Jan 23 '13 at 12:10