How many bits or bytes are there in a character?

Question

How many bits or bytes are there per "character"?

Rosh Oxymoron · Answer 1 · 2011-01-31T11:53:42.897

It depends what is the character and what encoding it is in:

An ASCII character in 8-bit ASCII encoding is 8 bits (1 byte), though it can fit in 7 bits.
An ISO-8895-1 character in ISO-8859-1 encoding is 8 bits (1 byte).
A Unicode character in UTF-8 encoding is between 8 bits (1 byte) and 32 bits (4 bytes).
A Unicode character in UTF-16 encoding is between 16 (2 bytes) and 32 bits (4 bytes), though most of the common characters take 16 bits. This is the encoding used by Windows internally.
A Unicode character in UTF-32 encoding is always 32 bits (4 bytes).
An ASCII character in UTF-8 is 8 bits (1 byte), and in UTF-16 - 16 bits.
The additional (non-ASCII) characters in ISO-8895-1 (0xA0-0xFF) would take 16 bits in UTF-8 and UTF-16.

That would mean that there are between 0.03125 and 0.125 characters in a bit.

This answer helps a lot when dealing with sockets, encoding, text and so on. — Mário Meyrelles, Jun 28 '16 at 14:21

RichardTheKiwi · Answer 2 · 2011-01-31T11:36:32.037

20

There are 8 bits in a byte (normally speaking in Windows).

However, if you are dealing with characters, it will depend on the charset/encoding. Unicode character can be 2 or 4 bytes, so that would be 16 or 32 bits, whereas Windows-1252 sometimes incorrectly called ANSI is only 1 bytes so 8 bits.

In Asian version of Windows and some others, the entire system runs in double-byte, so a character is 16 bits.

EDITED

Per Matteo's comment, all contemporary versions of Windows use 16-bits internally per character.

edited Jan 31 '11 at 11:36

answered Jan 31 '11 at 11:19

RichardTheKiwi

105,798
26
196
262

some legacy apps still use 1 byte chars with local codepages, but all NT versions of Windows internally run with 2-byte characters (UCS-2 up to NT4, UTF-16 from Windows 2000 onwards, stored as `wchar_t`), not only Asian ones, and so should do all the newer applications. (On Linux, instead, it's a completely different story since usually UTF-8 is used throughout the whole system) – Matteo Italia Jan 31 '11 at 11:31
@Matteo: Note that in Windows, double-byte is not necessarily the same thing as Unicode. [Reference](http://msdn.microsoft.com/en-us/library/cc194788.aspx) – Cody Gray - on strike Jan 31 '11 at 11:36
@Cody Gray: yes, usually when you read "double-byte" encoding it's legacy Asian stuff, and they are stored as multiple `char`, while Unicode strings are stored using the `wchar_t` type. By the way, when NT was started a `wchar_t` was enough to avoid surrogate pairs, but now that it's UTF-16 even `wchar_t` strings can have variable-length characters, so on Windows a Unicode character in can take from 2 to 4 bytes (1 or 2 `wchar_t`). – Matteo Italia Jan 31 '11 at 11:42
@Matteo: Yeah, I agree with you. I think I saw something that suggested differently before you edited your first comment, and that's when I wrote mine. UTF-16 Unicode strings are used internally now for all versions of Windows. – Cody Gray - on strike Jan 31 '11 at 11:44
@Cody Gray: I tend to edit my comments a bit too much, it leads to confusion `:)` – Matteo Italia Jan 31 '11 at 11:45

How many bits or bytes are there in a character?

2 Answers2

Linked