140

How many bits or bytes are there per "character"?

Louis Yang
  • 3,511
  • 1
  • 25
  • 24
RedKing
  • 1,563
  • 4
  • 12
  • 10

2 Answers2

282

It depends what is the character and what encoding it is in:

  • An ASCII character in 8-bit ASCII encoding is 8 bits (1 byte), though it can fit in 7 bits.

  • An ISO-8895-1 character in ISO-8859-1 encoding is 8 bits (1 byte).

  • A Unicode character in UTF-8 encoding is between 8 bits (1 byte) and 32 bits (4 bytes).

  • A Unicode character in UTF-16 encoding is between 16 (2 bytes) and 32 bits (4 bytes), though most of the common characters take 16 bits. This is the encoding used by Windows internally.

  • A Unicode character in UTF-32 encoding is always 32 bits (4 bytes).

  • An ASCII character in UTF-8 is 8 bits (1 byte), and in UTF-16 - 16 bits.

  • The additional (non-ASCII) characters in ISO-8895-1 (0xA0-0xFF) would take 16 bits in UTF-8 and UTF-16.

That would mean that there are between 0.03125 and 0.125 characters in a bit.

Rosh Oxymoron
  • 20,355
  • 6
  • 41
  • 43
20

There are 8 bits in a byte (normally speaking in Windows).

However, if you are dealing with characters, it will depend on the charset/encoding. Unicode character can be 2 or 4 bytes, so that would be 16 or 32 bits, whereas Windows-1252 sometimes incorrectly called ANSI is only 1 bytes so 8 bits.

In Asian version of Windows and some others, the entire system runs in double-byte, so a character is 16 bits.

EDITED

Per Matteo's comment, all contemporary versions of Windows use 16-bits internally per character.

RichardTheKiwi
  • 105,798
  • 26
  • 196
  • 262
  • some legacy apps still use 1 byte chars with local codepages, but all NT versions of Windows internally run with 2-byte characters (UCS-2 up to NT4, UTF-16 from Windows 2000 onwards, stored as `wchar_t`), not only Asian ones, and so should do all the newer applications. (On Linux, instead, it's a completely different story since usually UTF-8 is used throughout the whole system) – Matteo Italia Jan 31 '11 at 11:31
  • @Matteo: Note that in Windows, double-byte is not necessarily the same thing as Unicode. [Reference](http://msdn.microsoft.com/en-us/library/cc194788.aspx) – Cody Gray - on strike Jan 31 '11 at 11:36
  • @Cody Gray: yes, usually when you read "double-byte" encoding it's legacy Asian stuff, and they are stored as multiple `char`, while Unicode strings are stored using the `wchar_t` type. By the way, when NT was started a `wchar_t` was enough to avoid surrogate pairs, but now that it's UTF-16 even `wchar_t` strings can have variable-length characters, so on Windows a Unicode character in can take from 2 to 4 bytes (1 or 2 `wchar_t`). – Matteo Italia Jan 31 '11 at 11:42
  • @Matteo: Yeah, I agree with you. I think I saw something that suggested differently before you edited your first comment, and that's when I wrote mine. UTF-16 Unicode strings are used internally now for all versions of Windows. – Cody Gray - on strike Jan 31 '11 at 11:44
  • @Cody Gray: I tend to edit my comments a bit too much, it leads to confusion `:)` – Matteo Italia Jan 31 '11 at 11:45