35

What exactly are unicode character codes? And how are they different from ascii characters?

CharlesB
  • 86,532
  • 28
  • 194
  • 218
Ghost
  • 1,777
  • 6
  • 31
  • 44

2 Answers2

54

Unicode is a way to assign unique numbers (called code points) to characters from nearly all languages in active use today, plus many other characters such as mathematical symbols. There are many ways to encode Unicode strings as bytes, such as UTF-8 and UTF-16.

ASCII assigns values only to 128 characters (a-z, A-Z, 0-9, space, some punctuation, and some control characters).

For every character that has an ASCII value, the Unicode code point and the ASCII value of that character are the same.

In most modern applications you should prefer to use Unicode strings rather than ASCII. This will for example allow you to have users with accented characters in their name or address, and to localize your interface to languages other than English.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • But how do i use unicode characters? I can use the ascii characters simply by typecasting the chars into ints , but can i do the same to unicode characters? – Ghost Apr 28 '12 at 07:53
  • @Ghost: It's not clear why you need to cast characters to ints. What are you trying to do? – Mark Byers Apr 28 '12 at 08:15
  • what i'm saying is that to get the ascii value of a character i need to typecast it: char a='a'; int b=(int)a //ascii value of a – Ghost Apr 28 '12 at 08:37
  • @Ghost: That code you just posted gives you the Unicode code point of the character. This is the same as the ASCII value for those characters that have an ASCII value. It's rare that you actually need to care what exact code point value a specific character has. – Mark Byers Apr 28 '12 at 08:40
  • @MarkByers: careful, it depends on programming language. Several use UTF-16 as their internal representation, so casting to int might give you only half of a surrogate pair. – Joe Hildebrand May 29 '14 at 16:08
  • @MarkByers. "For every character that has an ASCII value, the Unicode code point and the ASCII value of that character are the same." This only applies UTF-8, right? – David Zheng Jul 08 '16 at 21:24
14

The first 128 Unicode code points are the same as ASCII. Then they have a 100,000 or so more.

There are two common formats for Unicode, UTF-8 which uses 1-4 bytes for each value (so for the first 128 characters, UTF-8 is exactly the same as ASCII) and UTF-16, which uses 2 or 4 bytes.

CodeClown42
  • 11,194
  • 1
  • 32
  • 67