8

What is the technically correct way of referring to "high ascii" or "extended ascii" characters? I don't just mean the range of 128-255, but any character beyond the 0-127 scope.

Often they're called diacritics, accented letters, sometimes casually referred to as "national" or non-English characters, but these names are either imprecise or they cover only a subset of the possible characters.

What correct, precise term that will programmers immediately recognize? And what would be the best English term to use when speaking to a non-technical audience?

Josh Lee
  • 171,072
  • 38
  • 269
  • 275
Marek Jedliński
  • 7,088
  • 11
  • 47
  • 57
  • 1
    I was trying to be concise, but perhaps I should have explained why I asked. I am a translator, my job is software localization. Often (still!) I encounter bugs where only those "national", "extended" characters in my language are garbled on display, usually because a wrong codepage was applied at some point. Therefore I need a term to refer to those specific characters, so that I don't always have to resort to a descriptive sentence, if possible. My audience are programmers, engineers and managers, for whom English isn't always their native tongue. – Marek Jedliński Oct 02 '09 at 21:05

8 Answers8

18

"Non-ASCII characters"

Aardvark
  • 8,474
  • 7
  • 46
  • 64
  • 1
    It seems definition by negation is the best we can do. As soon as we add "Unicode", the term won't be applicable in non-Unicode contexts, etc. I liked sgm's idea of "trans-ascii", but a fresh coinage won't cut it, especially when communicating across languages. – Marek Jedliński Oct 02 '09 at 20:55
2

ASCII character codes above 127 are not defined. many differ equipment and software suppliers developed their own character set for the value 128-255. Some chose drawing symbols, sone choose accent characters, other choose other characters.

Unicode is an attempt to make a universal set of character codes which includes the characters used in most languages. This includes not only the traditional western alphabets, but Cyrillic, Arabic, Greek, and even a large set of characters from Chinese, Japanese and Korean, as well as many other language both modern and ancient.

There are several implementations of Unicode. One of the most popular if UTF-8. A major reason for that popularity is that it is backwards compatible with ASCII, character codes 0 to 127 are the same for both ASCII and UTF-8.

That means it is better to say that ASCII is a subset of UTF-8. Characters code 128 and above are not ASCII. They can be UTF-8 (or other Unicode) or they can be a custom implementation by a hardware or software supplier.

Jim C
  • 4,981
  • 21
  • 25
  • 5
    The UTFs are not "implementations" of Unicode. They are encodings of Unicode text into bytestrings. Unicode text is represented as a sequence of numbers (*not* `int`s or `long`s, *numbers*), and the UTFs are ways of translating each number into a sequence of one or more bytes. – yfeldblum Oct 02 '09 at 19:56
  • Jim, thank you, but I am more or less aware of what those are :) I was only looking for a precise name. – Marek Jedliński Oct 02 '09 at 20:50
1

You could coin a term like “trans-ASCII,” “supra-ASCII,” “ultra-ASCII” etc. Actually, “meta-ASCII” would be even nicer since it alludes to the meta bit.

Nietzche-jou
  • 14,415
  • 4
  • 34
  • 45
  • 1
    I like "trans-ascii" and I think it correctly expresses the idea, but I am primarily looking for a good term to communicate the concept. Using a self-coined term may not do that :) – Marek Jedliński Oct 02 '09 at 20:53
0

"Extended ASCII" is the term I'd use, meaning "characters beyond the original 0-127".

Unicode is one possible set of Extended ASCII characters, and is quite, quite large.

UTF-8 is the way to represent Unicode characters that is backwards-compatible with the original ASCII.

Dean J
  • 39,360
  • 16
  • 67
  • 93
  • 2
    My thought was "extended ascii" would only refer to 128-255. Anything that cannot be expressed in that range isn't really ascii any more :) – Marek Jedliński Oct 02 '09 at 17:53
  • 2
    Note also (from wikipedia) that the use of the term 'extended ASCII' has been criticized, because it can be mistaken for an extension of the ASCII standard. – thomasrutter May 27 '10 at 05:53
  • @thomasrutter; if you're going to alter my answer that much in an edit, please just post a different answer, and/or leave a comment here at least? – Dean J May 27 '10 at 13:43
  • Gee, I was just trying to be helpful. I've rolled everything back. – thomasrutter May 29 '10 at 06:52
0

A bit sequence that doesn't represent an ASCII character is not definitively a Unicode character.

Depending on the character encoding you're using, it could be either:

  • an invalid bit sequence
  • a Unicode character
  • an ISO-8859-x character
  • a Microsoft 1252 character
  • a character in some other character encoding
  • a bug, binary data, etc

The one definition that would fit all of these situations is:

  • Not an ASCII character

To be highly pedantic, even "a non-ASCII character" wouldn't precisely fit all of these situations, because sometimes a bit sequence outside this range may be simply an invalid bit sequence, and not a character at all.

thomasrutter
  • 114,488
  • 30
  • 148
  • 167
0

Taken words from an online resource (Cool website though) because I found it useful and appropriate to write and answer.

At first only included capital letters and numbers , but in 1967 was added the lowercase letters and some control characters, forming what is known as US-ASCII, ie the characters 0 through 127. So with this set of only 128 characters was published in 1967 as standard, containing all you need to write in English language.

In 1981, IBM developed an extension of 8-bit ASCII code, called "code page 437", in this version were replaced some obsolete control characters for graphic characters. Also 128 characters were added , with new symbols, signs, graphics and latin letters, all punctuation signs and characters needed to write texts in other languages, ​such as Spanish. In this way was added the ASCII characters ranging from 128 to 255.

IBM includes support for this code page in the hardware of its model 5150, known as "IBM-PC", considered the first personal computer. The operating system of this model, the "MS-DOS" also used this extended ASCII code.

Iqra.
  • 685
  • 1
  • 7
  • 18
-1

Non-ASCII Unicode characters.

Amok
  • 1,279
  • 9
  • 10
  • 1
    This is incorrect. Unicode has nothing to do with ASCII, except for being backwards compatible for the first 127 code points. – Dervin Thunk Oct 02 '09 at 18:04
  • That's the point. All of the Unicode characters that don't have ASCII equivalents. – Amok Oct 02 '09 at 18:10
  • 2
    @Dervin: just as values over 127 have nothing to do with ASCII. – Joachim Sauer Mar 09 '10 at 13:50
  • A character outside of the ASCII range is not a Unicode character. It's a character outside of the ASCII range. Depending on the character encoding you're using, it's either: an invalid bit sequence; a Unicode character, an ISO-8859-x character, a Microsoft 1252 character, or a character in some other character encoding. – thomasrutter May 27 '10 at 05:55
-1

If you say "High ASCII", you are by definition in the range 128-255 decimal. ASCII itself is defined as a one-byte (actually 7-bit) character representation; the use of the high bit to allow for non-English characters happened later and gave rise to the Code Pages that defined particular characters represented by particular values. Any multibyte (> 255 decimal value) is not ASCII.

DaveE
  • 3,579
  • 28
  • 31