3

I haven't found much (concise) info about when exactly to use Unicode. I understand that many say best practice is to always use Unicode. But Unicode strings DO have more memory footprint. Am I correct to say that Unicode must be used only when

  • Printing something to screen other than local (for example debugging) use.
  • Generally, sending any type of text across a network with the two ends being in different locales/country
  • When you're not sure which to use

I think it would be beneficial if someone explained the basics (concise) of what actually happens with Unicode... am I correct to say that things get messy when :

  • the physical (byte) string gets sent to a machine using a representation of strings (code page, others... this is already detail although interesting) different from the sender.

The context is using Unicode in a programming language (say C++), but I hope answers to this question can be used for any encoding situation.
Also, I'm aware Unicode and NLS are not the same thing, but is it correct to say that NLS implies usage of Unicode?

P.S. awesome site

Kharski
  • 43
  • 7

3 Answers3

5

Always use Unicode, it will save you and others a lot of pain.

What you may have confused is the issue of encoding. Unicode strings do not necessarily take more memory than the equivalent ASCII (or other encoding) strings, that depends a lot on the encoding used.

Sometimes "Unicode" is used as a synonym for "UCS-2" or "UTF-16". Strictly speaking that use is wrong, because "Unicode" is the standard that defines the set of characters and their unicode codepoints. It does not as such define a mapping to bytes (or words). UTF-16, UTF-8 and other encoding take over the job of mapping the characters to concrete bytes.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
  • 1
    Absolutely right about needing no more space for ASCII strings encoded as UTF-8, which is how most Unicode text is transmitted or stored on disk. – andrewmu Oct 24 '11 at 10:18
  • @Joachim Sauer : So if I use unicode supported data types in databases, they will not take more space than normal string? I asked a question similar to this one here http://stackoverflow.com/questions/7860643/to-use-unicode-or-not-in-web-development-project-using-flask-and-sqlalchemy – codecool Oct 27 '11 at 11:56
  • @codecool: that depends on what encoding your database uses. If it uses UTF-8, then it *won't* need more space for text that can also be represented in ASCII (i.e. most english text). – Joachim Sauer Oct 27 '11 at 12:01
  • @JoachimSauer mysql stores UTF-8 in a space that's large enough to store the maximum size of that number of characters; that is, it takes 3 or 4 times as much as space as ASCII or Latin-1 (3 for utf8 and 4 for utf8mb4). – prosfilaes Dec 30 '12 at 05:17
4

The beauty of Unicode is that it frees you from restrictions and lots of headaches. Unicode is the largest character set available to date, i.e. it enables you to actually encode and use virtually any character of any halfway mainstream language in use today. With any other character set you need to think about whether it can actually encode a character or not. Latin-1 cannot encode the character "あ", Shift-JIS cannot encode the character "ڥ" and so on. Only if you're very sure you will never ever need anything other than basic Latin/Arabic/Japanaese/whatever other subset of characters should you choose a specialized encoding such as Latin-1, BIG-5, Shift-JIS or ASCII.

Unicode is the most versatile charset available and therefore a good standard to adhere to.

The Unicode encodings are nothing special, they're just a little more complex in their bit representation since they have to encode many more characters while still trying to be space efficient. For a very detailed excursion into this topic, please see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

deceze
  • 510,633
  • 85
  • 743
  • 889
1

I have a little utility which is sometimes helpful in seeing the difference between character encodings. http://sodved.awardspace.info/unicode.pl. If you paste in ö into the Raw (UTF-8) field you will see that it is represented by different byte sequences in different encodings. And as the other two good answers describe, some non-unicode encodings cannot represent it at all.

Sodved
  • 8,428
  • 2
  • 31
  • 43
  • Seems nice but can't check from office unfortunately : Trend Micro OfficeScan Event URL Blocked The URL that you are attempting to access is a potential security risk. Trend Micro OfficeScan has blocked this URL in keeping with network security policy. URL: http://sodved.awardspace.info/unicode.pl Risk Level: Dangerous Details: Verified fraud page or threat source – Kharski Nov 12 '12 at 12:46
  • Awardspace is just a free hosting site. Guess someone else has done dodgey stuff there in the past – Sodved Nov 14 '12 at 04:04