0

I'm working with ICQ protocol and I found problem with special letters (fxp diacritics). I read that ICQ using another encoding (CP-1251 if I remember).

How can I decode string with text to correct encoding?

I've tried using UTF8Encoding class, but without success.

Using ICQ-sharp library.

    private void ParseMessage (string uin, byte[] data)
    {
        ushort capabilities_length = LittleEndianBitConverter.Big.ToUInt16 (data, 2);
        ushort msg_tlv_length = LittleEndianBitConverter.Big.ToUInt16 (data, 6 + capabilities_length);
        string message = Encoding.UTF8.GetString (data, 12 + capabilities_length, msg_tlv_length - 4);

        Debug.WriteLine(message);
    }

If contact using the same client it's OK, but if not incoming and outcoming messages with diacritics are just unreadable.

I've determinated (using this -> https://stackoverflow.com/a/12853721/846232) that it's in BigEndianUnicode encoding. But if string not contains diacritics its unreadable (chinese letters). But if I use UTF8 encoding on text without diacritics its ok. But I don't know how to do that it will be encoded right allways.

Community
  • 1
  • 1
sczdavos
  • 2,035
  • 11
  • 37
  • 71
  • Wait, are you saying that, using UTF-16, texts with diacritics work, but texts without diacritics don't work? Could it be that it uses US-ASCII if it fits (no diacritics) and UTF-16 if it contains diacritics? Trying to use as UTF-16 to decode text that is encoded as ASCII could certainly produce Chinese characters... – johv Oct 27 '12 at 20:12
  • I have edited your title. Please see, "[Should questions include “tags” in their titles?](http://meta.stackexchange.com/questions/19190/)", where the consensus is "no, they should not". – John Saunders Oct 27 '12 at 20:28

1 Answers1

1

If UTF-8 kinda works (i.e. it works for "english", or any US-ASCII characters), then you don't have UTF-16. Latin1 (or Windows-1252, Microsoft's variant), or e.g. Windows-1251 or Windows-1250 are perfectly possible though, since these the first part containing latin letters without diacritics are the same.

Decode like this:

var encoding = Encoding.GetEncoding("Windows-1250");
string message = encoding.GetString(data, 12 + capabilities_length, msg_tlv_length - 4);
johv
  • 4,424
  • 3
  • 26
  • 42
  • This also works only for letters without diacritics. If there is only one diacritics letter it's all unreadable. I'm using ICQ# library for working with ICQ protocol. I don't know how it exactly works, but if both contacts using my client it's allright, but another clients probably using another encoding and I've a problem. – sczdavos Oct 27 '12 at 20:36
  • So, if you use "UTF-16" instead, does it then only work for messages with diacritics instead? (when communicating with other clients) – johv Oct 28 '12 at 20:25
  • Yes, UTF 16 with big endian byte order works only for message with diacritics (when communicating with other clients). For sending I'm using CP-1251. It works great with everything, but without diacritics (if text contains diacritics - it will just remove it). But for incoming messages this does not work. it's really strange I know :D But I have this problem only in ICQ protocol. I'm also working with Skype and XMPP and all is perfect. I'm using ICQ# library. Cause I haven't found any other working with easy usage. And I haven't found any library with documentation ICQ# doesn't have it too. – sczdavos Oct 29 '12 at 17:23
  • Surely there must be some way to know what encoding an incoming message has. See this bug report, they're talking about ICQ encodings: https://developer.pidgin.im/ticket/10833 – johv Oct 30 '12 at 07:55
  • As I understood I should check if message contains only ascii chars. I've already tried this: `msg.ToCharArray().Any(c => c > 255);` which should return if message contains any char with ASCII code bigger that 255. But this is not working. I found that fxp for `č` which have I think 237 in ASCII I'm getting another value about 266. So I don't know how to check if message contains non-ASCII chars? For sending now I'm just converting it to ASCII ('č' -> 'c', 'š' -> 's' etc). But in incoming messages I need to decode it firtst. ANd I don't know how to check if it's text or some unreadable chars. – sczdavos Oct 30 '12 at 15:16
  • There is probably some other data field which can tell you what encoding the message has. Note that ASCII means `c<128`, i.e. plain "english" letters with no diacritics. I don't see how checking for ASCII would help when decoding a message though. – johv Oct 30 '12 at 15:23
  • I mean checking for ASCII then sending message. As they wrote they are checking if message has only ASCII chars then encode as ASCII in other way UTF16. I though that for incoming message I just try decode it by ASCII and if it will be just some chinese unreadable chars I'll decode as UTF16. But I don't know how check this. Checking for ASCII chars doesn't work. – sczdavos Oct 30 '12 at 15:32
  • Well, UTF-16 incorrectly decoded as Latin-1 (or CP-1252 or something) might indeed look like ASCII. Can't you use the "Block Character Set" field (in the packet) that they are talking about? – johv Oct 30 '12 at 16:14
  • Ahh I see, sorry I just misread this. How can I access this field? I don't have much experiences with decoding `byte`. – sczdavos Oct 30 '12 at 16:59