2

I read this very important blog regarding string encodings.

After reading it I realized that unicode is a standard of mapping characters to code points which are integers. How these integers are stored in memory is an entirely different concept. This is where .utf8, .utf16 come into play, defining the way we store these integers in memory.

In the Swift String API there is a method which gives us the data bytes used to represent the String in various encodings:

func data(using encoding: String.Encoding, allowLossyConversion: Bool = false) -> Data?

The first parameter to this method is of Type String.Encoding. This struct Encoding has an encoding declared as:

static let unicode: String.Encoding

Now suddenly the method can give me data representation of the String using the encoding .unicode

Now this is in fact opposite to what I concluded after reading the mentioned blog. Its giving me data representation of a string, even thought unicode does not provide me details of how it can be stored.

Can any one tell me what am I missing here? I am really confused now.

Rohan Bhale
  • 1,323
  • 1
  • 11
  • 29
  • I'm curious. Can you encode the string `` using that scheme and dump the resulting octets? – daxim Jul 31 '19 at 10:15
  • That is actually “UTF-16 little endian with byte order marker”  – compare https://stackoverflow.com/questions/4841295/what-does-cocoa-touchs-canonical-nsunicodestringencoding-mean. – Martin R Jul 31 '19 at 10:17
  • @daxim It is unicodeData: . – Rohan Bhale Jul 31 '19 at 10:28
  • @MartinR I confirmed it is utf16 with byte order mark set to fffe, conforming it to be little endian. My next concern would be the answer you linked states "(the same guide goes on to say "That doesn’t necessarily imply anything about their internal storage mechanism", so they're fully reserving the right to change this in future)". So I assume the way .unicode encoding will be treated, can be changed in future and in a way not safe to use. – Rohan Bhale Jul 31 '19 at 10:31
  • @MartinR Also it does not make sense to have a unicode encoding. I fear it adds to confusion among learners. Does Apple have some use cases to have this encoding? – Rohan Bhale Jul 31 '19 at 10:34
  • I assume that there are historical reasons for the name, like `unichar` for UTF-16. – Martin R Jul 31 '19 at 10:38
  • 1
    .NET has the same issue. It's legacy. See https://stackoverflow.com/questions/52537306/encoding-utf8-or-encoding-unicode. – CodeCaster Jul 31 '19 at 10:42
  • I don't get your question like what you are actually asking? Data is a way to represent your type which can be stored on disk or any physical memory. Regarding swift, it is because of the legacy that they keep uniform Unicode, not particular utf-8. – Rahul Jul 31 '19 at 10:50
  • @Rahul My question was if we have utf8, utf16 which define the low level storage of any unicode code point, why is there a separate unicode encoding available in the possible options. Also unicode directly does not define the low level storage of its codepoints. Unicode conforming encodings like utf8 and utf16 take care of defining the low level storage. – Rohan Bhale Jul 31 '19 at 13:26

1 Answers1

5

String.Encoding.unicode is the same as String.Encoding.utf16

print(String.Encoding.unicode)
print(String.Encoding.utf16)

The above prints:

  • Unicode (UTF-16)
  • Unicode (UTF-16)
Jean-Pierre
  • 560
  • 4
  • 6