3

I've written an appointment scheduling system which (among other things) sends out a reminder SMS the day before an appointment is due. It asks the user to confirm their attendance at the appointment by replying "OK" to the text.

Where people do reply it generally works well and has cut out a huge manual workload. I'm now in the process of tidying up a couple of defects (thankfully they're few and of low impact) but occasionally I see responses of @u{some string}. I don't have rules to parse this so they go into an invalid responses bucket for manual follow-up.

Today I saw a response that looked as follows:

@u004f006b

I'm pretty sure at this stage that the @u denotes that what follows is Unicode (similar to the \u designator in C#) so making that assumption I get the following:

U+004F => decimal 79 => O (uppercase)

U+006B => decimal 107 => k (lowercase)

The company that's responsible tell me that the message is hitting their servers like that so it must be a client issue right? I've looked in my SMS sending app (ChompSMS on Android 7.x) and can't see anything that'd set it to explicitly send it in Unicode vs ASCII, so I'm wondering how this happens?

I pulled 10 random responses that began with this Unicode designator out of the database and had a go at writing something to deal with them. What follows is my naïve attempt at this:

using System;
using System.Text;

namespace CharConversion
{
    class Program
    {
        static void Main()
        {
            string[] unicodeResponses = new string[]
            {
                "@U00430061006e20190074002000620065002000610062006c006500200074006f002000620065002000740068006500720065",
                "@U004f006b002000bf00bf",
                "@U004f006b002000bf00bf",
                "@U004f004b002000bf00bf",
                "@U004f006b002000bf00bf",
                "@U00d2006b",
                "@U004f004b",
                "@U004f006b00610079002000bf00bf0020",
                "@U004f004b",
                "@U004f006b00bf00bf00bffffd"
            };

            foreach (string unicodeResponse in unicodeResponses)
            {
                string characters2 = UnicodeCodePointsToString(unicodeResponse);
                Console.WriteLine("'{0}' is '{1}' in plain text", unicodeResponse, characters2);
            }

            Console.Read();
        }

        private static string UnicodeCodePointsToString(string unicodeResponse)
        {
            string[] characterByteValues = SplitStringEveryN(unicodeResponse.Substring(2), 4);
            char[] characters = new char[characterByteValues.Length];

            for (int i = 0; i < characterByteValues.Length; i++)
            {
                int ordinal = Int32.Parse(characterByteValues[i], System.Globalization.NumberStyles.HexNumber);
                characters[i] = (char) ordinal;
            }

            return new string(characters);
        }

        private static string[] SplitStringEveryN(string input, int splitLength)
        {
            StringBuilder sb = new StringBuilder();
            for (int i = 0; i < input.Length; i++)
            {
                if (i % splitLength == 0)
                {
                    sb.Append(' ');
                }
                sb.Append(input[i]);
            }

            string[] returnValue = sb.ToString().TrimStart().Split(' ');
            return returnValue;
        }
    }
}

My questions:

  1. Why is this happening in the first place?

  2. With the code - is there anything I'm missing here? E.g. is there something in the Framework that can already handle this for me, or is there some glaring shortcoming that People Who Know All About Unicode can see? Is there something I can do better?

  3. Some of the code points still render as upside-down questions (I suspect myself that these are emojis) - is there any way I can handle them?

EDIT 2018-04-26 A note for posterity

(I was going to put this in a comment but it looked awful no matter what I did with it)

I had a look at the link in the accepted answer, and while the code is more concise than mine, the output at the end is identical - including the inverted question marks (and the glyphs I suspect are emojis). Some more reading on the differences between Unicode and UCS2 can be found here and the Wikipedia article is worth a read as well:

TL;DR

  • UCS-2 is obsolete and has since been replaced with UTF-16 UCS-2 is a fixed width encoding scheme while UTF-16 is a variable width encoding scheme
  • UTF-16 capable applications can read UCS-2 files but not the other way around
  • UTF-16 supports right to left scripts while UCS-2 does not
  • UTF-16 supports normalization while UCS-2 does not
noonand
  • 2,763
  • 4
  • 26
  • 51
  • 1
    I suspect that inverted question marks are indeed inverted question marks. For example, second sample string contains "0x00BF, 0x00BF" in the end, which are indeed unicode inverted question marks. They are used for example in Spanish, so there is a chance they are legitimate (like "OK ??" message). – Evk Apr 25 '18 at 10:07
  • 1
    And in the last string, last character is `0xfffd` which is generic unicode replacement character, used to replace unrecognized stuff. So I suspect that even if there were some emojis - they are lost _before_ this data reached you. – Evk Apr 25 '18 at 10:12
  • ^ This. Something upstream is doing this as the supplier insists that this is how the messages arrive to them. Anyway I think it's "good enough" now so that the people reading the report can infer enough meaning from it now. Thanks again! – noonand Apr 26 '18 at 07:29
  • Yes, as I understand all you need is to figure out whether reply is "OK" or not, and that's good enough for that. – Evk Apr 26 '18 at 07:31
  • Exactly, it still goes into the invalid response category but the end user can infer enough meaning from it so they can manually confirm the appointment without having to ring the customer. – noonand Apr 26 '18 at 07:33

2 Answers2

3

SMS message can be encoded with several encodings. Those include 7-bit (GSM-7), 8-bit and 16-bit (UCS2). While most SMS programs encode message in the least wasteful encoding - there is nothing invalid in using 16-bit one even if all characters fall into the range of other encodings. That's I assume what happens in your case. Of course sms messages are transferred as bytes, not as u004f006b strings, so why it is represented like that is a matter of the tools you use \ third parties you work with.

As for your parsing code. It assumes that string is in UTF-16 (internal representation of C# string), but if the above is correct, encoding is UCS2. It's very similar to UTF-16, but not exactly the same. I'm not quite qualified to discuss differences, but you can look at for example this answer for some clues about how you can work with it. That also might be the reason why some characters are decoded incorrectly.

Evk
  • 98,527
  • 8
  • 141
  • 191
-2

Here is simpler method :

using System;
using System.Text;

namespace CharConversion
{
    class Program
    {
        static void Main()
        {
            string[] unicodeResponses = new string[]
            {
                "@U00430061006e20190074002000620065002000610062006c006500200074006f002000620065002000740068006500720065",
                "@U004f006b002000bf00bf",
                "@U004f006b002000bf00bf",
                "@U004f004b002000bf00bf",
                "@U004f006b002000bf00bf",
                "@U00d2006b",
                "@U004f004b",
                "@U004f006b00610079002000bf00bf0020",
                "@U004f004b",
                "@U004f006b00bf00bf00bffffd"
            };

            string message = "";

            foreach (string unicodeResponse in unicodeResponses)
            {
                for (int i = 2; i < unicodeResponse.Length; i += 4)
                {
                    message += (char)Int16.Parse(unicodeResponse.Substring(i, 4), System.Globalization.NumberStyles.HexNumber);
                }
            }
            Console.WriteLine(message);
            Console.Read();
        }


    }
}
jdweng
  • 33,250
  • 2
  • 15
  • 20
  • 2
    Code dumps are not considered answers, specially when the OP asks questions that are not answered by code. Not sure how this dump helps anyone understand the issue – Camilo Terevinto Apr 24 '18 at 17:27
  • While this is somewhat useful (it makes the code more succinct) it's at the expense of readability and maintainability IMO. Also, it only attempts to answer a third of the question as asked – noonand Apr 24 '18 at 19:12
  • 1
    I made it better. When somebody creates code that is extremely complicated instead of something very simple you do not try to fix the bad code. Instead recommend a better solution. – jdweng Apr 24 '18 at 23:35