1

I know that UTF-16 is a self-synchronizing encoding scheme. I also read the below Wiki, but did not quite get it.

Self Synchronizing Code

Can you please explain me with an example of UTF-16?

phuclv
  • 37,963
  • 15
  • 156
  • 475
dinesh ranawat
  • 223
  • 2
  • 8

1 Answers1

1

In UTF-16 characters outside of the BMP are represented using a surrogate pair in with the first code unit (CU) lies between 0xD800—0xDBFF and the second one between 0xDC00—0xDFFF. Each of the CU represents 10 bits of the code point. Characters in the BMP is encoded as itself.

Now the synchronization is easy. Given the position of any arbitrary code unit:

  • If the code unit is in the 0xD800—0xDBFF range, it's the first code unit of two, just read the next one and decode. Voilà, we have a full character outside of BMP
  • If the code unit is in the 0xDC00—0xDFFF range, it's the second code unit of two, just go back one unit to read the first part, or advance to the next unit to skip the current character
  • If it's in neither of those ranges then it's a character in BMP. We don't need to do anything more

In UTF-16 CU is the unit, i.e. the smallest element. We work at the CU level and read the CU one-by-one instead of byte-by-byte. Because of that along with historical reasons UTF-16 is only self-synchronizable at CU level.

The point of self-synchronization is to know whether we're in the middle of something immediately instead of having to read again from the start and check. UTF-16 allows us to do that

Since the ranges for the high surrogates, low surrogates, and valid BMP characters are disjoint, it is not possible for a surrogate to match a BMP character, or for (parts of) two adjacent characters to look like a legal surrogate pair. This simplifies searches a great deal. It also means that UTF-16 is self-synchronizing on 16-bit words: whether a code unit starts a character can be determined without examining earlier code units. UTF-8 shares these advantages, but many earlier multi-byte encoding schemes (such as Shift JIS and other Asian multi-byte encodings) did not allow unambiguous searching and could only be synchronized by re-parsing from the start of the string (UTF-16 is not self-synchronizing if one byte is lost or if traversal starts at a random byte).

https://en.wikipedia.org/wiki/UTF-16#Description

Of course that means UTF-16 may be not suitable for working over a medium without error correction/detection like a bare network environment. However in a proper local environment it's a lot better than working without self-synchronization. For example in DOS/V for Japanese every time you press Backspace you must iterate from the start to know which character was deleted because in the awful Shift-JIS encoding there's no way to know how long the character before the cursor is without a length map

Markus Unterwaditzer
  • 7,992
  • 32
  • 60
phuclv
  • 37,963
  • 15
  • 156
  • 475
  • 1
    Do you know of any practical use of this? I can't think of any use case for jumping into the middle of a sequence of code units. Splitting text based on positions should be at grapheme boundaries. – Tom Blodget Sep 09 '18 at 11:13
  • @TomBlodget the usage is already in the wikipedia quote above. Random reads are more common than you thought. For example if we need to get the last few characters/words of a file then self-synchronization allows us to do that quickly instead of parsing from the start. `grep` also works by searching the whole file instead of breaking into sentences, and then when it finds a match it'll [back-off and find the new line character](https://stackoverflow.com/q/12629749/995714) to confirm and print the line. That won't be possible without self-synchronization – phuclv Sep 09 '18 at 11:33
  • 1
    Text editors can also utilized self-synchronization to work more efficiently on huge files. For example if you delete a character it knows immediately what should be the output. In old DOS/V for Japanese, due to the lack of self-synchronization, everytime you press backspace then it'll have to go from the start and reparse the whole buffer, since it doesn't know how many bytes should it walk back to get the previous character – phuclv Sep 09 '18 at 11:38
  • What is the advantage of being self-synchronizing on 16-bit and not at random byte? We read the string byte-by-byte – dinesh ranawat Sep 21 '18 at 05:08
  • Why on earth do you need that? It's easy to check if a byte is the first one or the second one in a code unit since one has an odd address and the other is even, but then you'll have to combine those bytes together to get a single 2-byte unit. Therefore it's just worthless because the unit is already 2 bytes long. That's the minimum you need to work with – phuclv Sep 21 '18 at 05:14
  • 1
    But where does the data in the buffer come from? Do you know whether or not a byte was dropped? Suppose one wasn't. Which byte is the LSB, the even or the odd byte? Can't rely on added BOM, since if you've ever worked on a NT-based OS, you'd know that many system files don't include the BOM. – Astara Jul 09 '19 at 08:03
  • @Astara if you communicate with others you have to agree on an encoding, either by convention or a property during handshaking. BOM is just used for random files without a clear source. And if a byte was dropped then the size would be odd and the parser will know it right away – phuclv Jul 09 '19 at 11:00
  • 1
    BOM's are not required and less used in network protocols, Unless you transfer a file as an object, you aren't likely to see one. If a byte was dropped out of your 1TB file, you would throw away the whole thing? You could easily lose 2 bytes spaced widely apart, and the bytes would be swapped between them. But the point wasn't on whether or not you could agree on a transfer format, but whether or not the protocol was self-synchronizing. At the basic level of communication (the byte) it is not. – Astara Jul 10 '19 at 15:30
  • 1
    @Astara so you're complaining with me why UTF-16 isn't self-synchronizable at byte level? I'm not the one who designed that encoding scheme and I don't care about that – phuclv Jul 10 '19 at 15:33
  • 1
    Well, where data comes in Bytes not words, it is something to care about. Original question just asked how it was self-synchronizing. In Byte oriented data, it isn't. – Astara Jul 10 '19 at 15:41
  • 1
    @Astara then read the question again. There's not a single "byte" word in it. The OP doesn't care about it and just asks how UTF-16 achieves self-synchronization – phuclv Jul 10 '19 at 15:53