144

My teacher told me ASCII is an 8-bit character coding scheme. But it is defined only for 0-127 codes which means it can be fitted into 7 bits. So can't it be argued that ASCII is actually a 7-bit code?

And what do we mean to say at all when saying ASCII is a 8-bit code at all?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Anurag Kalia
  • 4,668
  • 4
  • 21
  • 28

7 Answers7

135

ASCII was indeed originally conceived as a 7-bit code. This was done well before 8-bit bytes became ubiquitous, and even into the 1990s you could find software that assumed it could use the 8th bit of each byte of text for its own purposes ("not 8-bit clean"). Nowadays people think of it as an 8-bit coding in which bytes 0x80 through 0xFF have no defined meaning, but that's a retcon.

There are dozens of text encodings that make use of the 8th bit; they can be classified as ASCII-compatible or not, and fixed- or variable-width. ASCII-compatible means that regardless of context, single bytes with values from 0x00 through 0x7F encode the same characters that they would in ASCII. You don't want to have anything to do with a non-ASCII-compatible text encoding if you can possibly avoid it; naive programs expecting ASCII tend to misinterpret them in catastrophic, often security-breaking fashion. They are so deprecated nowadays that (for instance) HTML5 forbids their use on the public Web, with the unfortunate exception of UTF-16. I'm not going to talk about them any more.

A fixed-width encoding means what it sounds like: all characters are encoded using the same number of bytes. To be ASCII-compatible, a fixed-with encoding must encode all its characters using only one byte, so it can have no more than 256 characters. The most common such encoding nowadays is Windows-1252, an extension of ISO 8859-1.

There's only one variable-width ASCII-compatible encoding worth knowing about nowadays, but it's very important: UTF-8, which packs all of Unicode into an ASCII-compatible encoding. You really want to be using this if you can manage it.

As a final note, "ASCII" nowadays takes its practical definition from Unicode, not its original standard (ANSI X3.4-1968), because historically there were several dozen variations on the ASCII 127-character repertoire -- for instance, some of the punctuation might be replaced with accented letters to facilitate the transmission of French text. All of those variations are obsolete, and when people say "ASCII" they mean that the bytes with value 0x00 through 0x7F encode Unicode codepoints U+0000 through U+007F. This will probably only matter to you if you ever find yourself writing a technical standard.

If you're interested in the history of ASCII and the encodings that preceded it, start with the paper "The Evolution of Character Codes, 1874-1968" (samizdat copy at http://falsedoor.com/doc/ascii_evolution-of-character-codes.pdf) and then chase its references (many of which are not available online and may be hard to find even with access to a university library, I regret to say).

Arlo
  • 3
  • 2
zwol
  • 135,547
  • 38
  • 252
  • 361
  • 5
    So is ASCII noawadays 7-bit or 8-bit? You say it uses 0x00-0x7F now, obviously. But do we count the leading 0? – Anurag Kalia Feb 04 '13 at 17:49
  • 11
    That depends on what kind of pedant you are. The specification that still officially defines ASCII (ANSI X3.4-1968) describes it as a 7-bit encoding, but nobody transmits 7-bit bytes anymore, and interoperability nowadays dictates that the eighth bit must be zero -- you can't use it for a parity bit or similar. So it is equally valid IMNSHO to describe ASCII as an eight-bit encoding that happens to leave the upper half of its number space as "reserved, do not use". Either way, if you transmit eight-bit bytes any of which have their high bit set, you are *not* transmitting valid ASCII. – zwol Feb 04 '13 at 19:12
  • 1
    (... but you might be transmitting valid something-else, like UTF-8 or ISO 8859-1 or KOI8-R.) – zwol Feb 04 '13 at 19:32
  • 1
    I couldn't understand this answer earlier but now it makes perfect sense. The same word ASCII has had its meaning changed overtime. Am I correct? (Sorry for late reply; it just didn't occur to me earlier.) – Anurag Kalia Aug 04 '13 at 20:29
  • @AnuragKalia The ASCII *standard* only defines 7 bits (i.e. values 0-127). But because computers standardize on 8-bit bytes, there's a whole extra bit in there which effectively doubles the available values (and therefore characters) up to 255. Because ASCII doesn't define those higher-order values, various competing standards (called code pages) arose with different higher-order characters, most of which include standard ASCII as the lower-order 7 bits. – devios1 Nov 21 '14 at 19:34
  • This is why you used to (not so much anymore thanks to Unicode) get gibberish characters sometimes when viewing extended characters on different machines. – devios1 Nov 21 '14 at 19:35
  • 1
    To be really pedantic, the standard is now INCITS 4-1986[R2012] because ASC [formerly known as](https://en.wikipedia.org/wiki/Prince_%28musician%29) X3 mutated into NCITS then INCITS. But the 7-bit variants with about a dozen accented letters for French, German, Spanish, etc. are not ANSI/INCITS anything, rather **ISO/IEC 646** and ECMA-6. And it is 8-bit (ISO/IEC) 8859-1 that forms the first 256-char block of Unicode. – dave_thompson_085 Dec 27 '15 at 21:37
  • 3
    @dave_thompson_085 Not everyone is as pedantic as you -- which means you can find older technical documentation, and even standards, that reference "ASCII", or even "X3.4-1968", intending to *include* the national variants, or at least not clearly ruling it out, leading to arguments. Therefore, I personally would use Unicode as the normative reference for ASCII if I had to write a spec where it mattered. That's all I meant. – zwol Dec 28 '15 at 21:56
  • I was curious about the conception of ASCII as being 7 bit. I guess this has to do with the registers on the old computers? I'd like to see a reference to learn more about that if anyone has a good one. All I found was this link where there arent any detail http://edition.cnn.com/TECH/computing/9907/06/1963.idg/. –  Jan 06 '17 at 06:32
  • 1
    @JulianCienfuegos Sorry I didn't notice your question till now. I can't answer it definitively but the right place to look for an answer is the early history of serial communications -- telegraph, teletype, etc. Starting points (courtesy my friend Leonard): http://falsedoor.com/doc/ascii_evolution-of-character-codes.pdf and C.E. Mackenzie's book "Coded Character Sets, History and Development" (1979) – zwol Oct 22 '17 at 15:43
  • 1
    It is like if I invent a game you play with your fingers. The game requires 4 fingers, but you have 5. Just don't use one of the fingers, and you are all set to play. – Rafael Eyng Sep 21 '21 at 00:38
24

On Linux man ascii says:

ASCII is the American Standard Code for Information Interchange. It is a 7-bit code.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
BeniBela
  • 16,412
  • 4
  • 45
  • 52
15

The original ASCII table is encoded on 7 bits, and therefore it has 128 characters.

Nowadays, most readers/editors use an "extended" ASCII table (from ISO 8859-1), which is encoded on 8 bits and enjoys 256 characters (including Á, Ä, Œ, é, è and other characters useful for European languages as well as mathematical glyphs and other symbols).

While UTF-8 uses the same encoding as the basic ASCII table (meaning 0x41 is A in both codes), it does not share the same encoding for the "Latin Extended-A" block. Which sometimes causes weird characters to appear in words like à la carte or piñata.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Guillaume
  • 477
  • 5
  • 6
  • 2
    There are several mistake in the above. Œ is not part of ISO 8859-1 though it is in [CP-1252](https://en.wikipedia.org/wiki/Windows-1252). And the [Latin Extended-A](https://en.wikipedia.org/wiki/Latin_Extended-A) block is not the first 128 or 256 characters of Unicode: it is the next block after these contains letters like ğ, ł and ſ. – Richard Smith Oct 30 '17 at 22:21
  • Good point! I think I meant "Latin-1 Supplement". Standards standards... – Guillaume Mar 21 '18 at 15:05
  • 3
    There are many "Extended ASCII" character sets and only one of them is ISO 8859-1. The term is almost meaningless because when you are encoding and decoding text you have to know which specific character encoding is being used (and it might not even be for an Extended ASCII character set). – Tom Blodget Jul 25 '18 at 00:33
8

ASCII encoding is 7-bit, but in practice, characters encoded in ASCII are not stored in groups of 7 bits. Instead, one ASCII is stored in a byte, with the MSB usually set to 0 (yes, it's wasted in ASCII).

You can verify this by inputting a string in the ASCII character set in a text editor, setting the encoding to ASCII, and viewing the binary/hex:
enter image description here

Aside: the use of (strictly) ASCII encoding is now uncommon, in favor of UTF-8 (which does not waste the MSB mentioned above - in fact, an MSB of 1 indicates the code point is encoded with more than 1 byte).

flow2k
  • 3,999
  • 40
  • 55
0

The original ASCII code provided 128 different characters numbered 0 to 127. ASCII and 7-bit are synonymous. Since the 8-bit byte is the common storage element, ASCII leaves room for 128 additional characters which are used for foreign languages and other symbols.

But the 7-bit code was original made before the 8-bit code. ASCII stand for American Standard Code for Information Interchange. In early Internet mail systems, it only supported 7-bit ASCII codes.

This was because it then could execute programs and multimedia files over such systems. These systems use 8 bits of the byte, but then it must then be turned into a 7-bit format using coding methods such as MIME, uucoding and BinHex. This means that the 8-bit characters has been converted to 7-bit characters, which adds extra bytes to encode them.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
brookey
  • 9
  • 1
0

Original ASCII contains unsigned (or positive) values from 0 to 127 (128 characters). Extended ASCII uses 8 bits and therefore has 256 potential values. Working is below.

64 32 16 8 4 2 1 (7 bits) 1 1 1 1 1 1 1 All decimal values = 127 however

0 0 0 0 0 0 0 ==> Is 0 so it has to be accounted for giving 127 + the first (the zeroth value) and so, 128 values in all.

-6

When we call ASCII a 7-bit code, the left-most bit is used as the sign bit, so with 7 bits we can write up to 127.

That means from -126 to 127, because the maximum values of ASCII is 0 to 255. This can be only satisfied with the argument of 7 bit if the last bit is considered as the sign bit.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
aju
  • 1