3

I was asked this question during an interview with a famous IT company. They asked me to suggest how a character encoding will be implemented if we have lots of characters & 16 bits of Unicode are not enough. I answered we can implement 64 bit encoding for characters. They said, even it's not enough, to which I suggested to implement a encoding via java BigInteger.

Then they asked the encoding should be such that it only takes the bits that are needed. Like ASCII representation of A is 01000001 , we should not be using the leading 0 because we don't need it and we are wasting memory. I could not give an answer to it. If you could please tell me about how to approach this problem and how it is handled.

bpjoshi
  • 1,091
  • 16
  • 23
  • 1
    You could study how it's handled in the numerous ways of encoding Unicode. – Kayaman Aug 03 '17 at 10:19
  • 1
    Maybe interesting to you: https://stackoverflow.com/questions/5290182/how-many-bytes-does-one-unicode-character-take?rq=1 –  Aug 03 '17 at 10:31
  • How is a 64-bit encoding "not enough" when the highest codepoint defined in Unicode fits in 21 bits? Even 16 bits is enough, when used in pairs. – Remy Lebeau Aug 05 '17 at 22:03

1 Answers1

1

See the Unicode Standard, Chapter 3: "The Unicode Standard supports three character encoding forms: UTF-32, UTF-16, and UTF-8. Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences. The size of the code unit is specified for each encoding form. This section presents the formal definition of each of these encoding forms."

As regards the question on saving bits, this is meaningful only when the text is very large, in which case I would suggest using compression, such as zip. There are solutions in various languages that let you read from and write to a compressed file directly.

Jonathan Rosenne
  • 2,159
  • 17
  • 27
  • It was an interview question which he failed to answer. – Kayaman Aug 03 '17 at 10:41
  • I have met several experienced programmers who would find themselves in a similar position, so I thought it would be worthwhile to provide my views. – Jonathan Rosenne Aug 03 '17 at 10:44
  • Are you sure they were experienced programmers and not just saying that. Certain people have a tendency to embellish their experience and knowledge. – Kayaman Aug 03 '17 at 10:48
  • Yes. Character encoding is not a simple matter, especially when the language you wish to encode is not English. It has become simpler with the wide spread use of Unicode, but there are still quite a lot of legacy encodings in common use. – Jonathan Rosenne Aug 03 '17 at 10:52
  • Encoding is indeed not a simple thing (I keep answering questions about it every few weeks). However suggesting `BigInteger` when 64 bits isn't enough means he failed to understand what the question was all about. I would expect experienced developers to have *some* understanding of character encoding, even if we nowadays are pretty much on UTF-8 everywhere. – Kayaman Aug 03 '17 at 11:11