How can I check if char encoding is ASCII?

Question

I would like to write the following function:

int char_index(char c) 
{
  if (is_ascii<char>)
    return c - 'A';
  else 
    return c == 'A' ? 0 :
           c == 'B' ? 1 :
           // ...
}

Is there a function like is_ascii in std? I'm imagining something like std::numeric_limits<T>::is_iec559 which says whether some floating point type T satisfies the requirements of the IEE 754 standard.

I think I can implement is_ascii myself with something like if (65 == 'A' && ...) that enumerates the entire ASCII charset, and compares them to the int representation, but that's annoying. Also, I'm not sure how to check non-printable characters like SOH (Start Of Heading), etc.

Is it even possible to write this function in user code, or do I have to rely on the implementation to provide such a function?

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/221789/discussion-on-question-by-cigien-how-can-i-check-if-char-encoding-is-ascii). — Bhargav Rao, Sep 20 '20 at 19:09
I would just require the program to be compiled with an ASCII or UTF-8 compliant compiler. Beware though, that your function is not UTF-8 safe: It won't handle multibyte characters like `ä`. — cmaster - reinstate monica, Sep 21 '20 at 08:03
If you already have `return c == 'A' ? 0 : c == 'B' ? 1 :` etc, then you don't need `is_ascii`, it will work in either case. — n. m. could be an AI, Sep 22 '20 at 12:33
@n.'pronouns'm. Right, of course, I don't want to have to write that though. I'd be fine asserting that non-ascii encodings are not handled in that case. (The example just shows one way to handle it). — cigien, Sep 22 '20 at 12:43
Write in the documentation that non-ascii encodings are not handled and be done with it. — n. m. could be an AI, Sep 22 '20 at 12:50
@n.'pronouns'm. Documentation is fine, but the compiler doesn't care about it. I'd like the program to *know* when this happens. — cigien, Sep 22 '20 at 12:51
"I'd like the program to know when this happens". Why? Do you realistically expect that it will happen enough times to warrant spending your time on this before the Yellowstone erupts? — n. m. could be an AI, Sep 22 '20 at 13:40
Honestly, no. But it's still interesting to know if/how this can be done :) — cigien, Sep 22 '20 at 13:58

KamilCuk · Answer 1 · 2020-09-21T07:52:40.663

I assume that you want to check if your compiler when translating string literals and character literals in your source code to machine code uses ascii encoding.

Is there a function like is_ascii in std?

Not that I know of.

I can implement is_ascii myself with something like if (65 == 'A' && ...) that enumerates the entire ASCII charset

So do that. Check characters that can be a c-char, so all from basic source character set:

a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " '

and escape sequences:

\a  \b  \f  \n  \r  \t  \v

There's no way to check "entire" ASCII charset, because the compiler doesn't transcribe the program to the entire ASCII charset. It only maps basic character set characters and escape sequences to it's machine representation, not the whole charset (there may be compiler extensions).

but that's annoying.

But that's the only way. To verify your implementation uses some character set you have to check all characters it uses. So check them. It's going to be consteval anyway.

how to check non-printable characters like SOH (Start Of Heading), etc.

Don't. SOH character can't be inside a character literal, you don't have to check them, because it's not possible to express it in C language. There is no \SOH escape sequence, 0x01 byte is not inside basic character set. Your compiler never translates a sequence of characters to SOH character. A valid program will be composed only from character from basic source character set. The interpretation of the SOH character is up to the thing that is going to receive it and if I write '\001' it's going to be byte equal to 1 irrelevant of the encoding.

Meh, let's write it! The following program:

#include <type_traits>
#include <algorithm>
constexpr bool compiler_uses_ascii() {
    return 
        '\a'==0x07  &&  '\b'==0x08  &&  '\t'==0x09  &&  '\n'==0x0a  &&  '\v'==0x0b  &&  '\f'==0x0c  &&
        '\r'==0x0d  &&  '!'==0x21   &&  '#'==0x23   &&  '%'==0x25   &&  '&'==0x26   &&  '\''==0x27  &&
        '('==0x28   &&  ')'==0x29   &&  '*'==0x2a   &&  '+'==0x2b   &&  ','==0x2c   &&  '-'==0x2d   &&
        '.'==0x2e   &&  '/'==0x2f   &&  '0'==0x30   &&  '1'==0x31   &&  '2'==0x32   &&  '3'==0x33   &&
        '4'==0x34   &&  '5'==0x35   &&  '6'==0x36   &&  '7'==0x37   &&  '8'==0x38   &&  '9'==0x39   &&
        ':'==0x3a   &&  ';'==0x3b   &&  '<'==0x3c   &&  '='==0x3d   &&  '>'==0x3e   &&  '?'==0x3f   &&
        'A'==0x41   &&  'B'==0x42   &&  'C'==0x43   &&  'D'==0x44   &&  'E'==0x45   &&  'F'==0x46   &&
        'G'==0x47   &&  'H'==0x48   &&  'I'==0x49   &&  'J'==0x4a   &&  'K'==0x4b   &&  'L'==0x4c   &&
        'M'==0x4d   &&  'N'==0x4e   &&  'O'==0x4f   &&  'P'==0x50   &&  'Q'==0x51   &&  'R'==0x52   &&
        'S'==0x53   &&  'T'==0x54   &&  'U'==0x55   &&  'V'==0x56   &&  'W'==0x57   &&  'X'==0x58   &&
        'Y'==0x59   &&  'Z'==0x5a   &&  '['==0x5b   &&  '\\'==0x5c  &&  ']'==0x5d   &&  '^'==0x5e   &&
        '_'==0x5f   &&  'a'==0x61   &&  'b'==0x62   &&  'c'==0x63   &&  'd'==0x64   &&  'e'==0x65   &&
        'f'==0x66   &&  'g'==0x67   &&  'h'==0x68   &&  'i'==0x69   &&  'j'==0x6a   &&  'k'==0x6b   &&
        'l'==0x6c   &&  'm'==0x6d   &&  'n'==0x6e   &&  'o'==0x6f   &&  'p'==0x70   &&  'q'==0x71   &&
        'r'==0x72   &&  's'==0x73   &&  't'==0x74   &&  'u'==0x75   &&  'v'==0x76   &&  'w'==0x77   &&
        'x'==0x78   &&  'y'==0x79   &&  'z'==0x7a   &&  '{'==0x7b   &&  '|'==0x7c   &&  '}'==0x7d   &&
        '~'==0x7e;
}
constexpr int char_index(char c)
{
    if constexpr (compiler_uses_ascii()) {
        return c - 'A';
    } else {
        // Is that right? Maybe it is.
        const char a[] = "ABCDEFGHIJKLMNOPRSTUVXYZ";
        return std::find(a, a + sizeof(a), c) - a;
#if 0
        return
            c == 'A' ? 0 :  c == 'B' ? 1 :  c == 'C' ? 2 :  c == 'D' ? 3 :
            c == 'E' ? 4 :  c == 'F' ? 5 :  c == 'G' ? 6 :  c == 'H' ? 7 :
            c == 'I' ? 8 :  c == 'J' ? 9 :  c == 'K' ? 10 : c == 'L' ? 11 :
            c == 'M' ? 12 : c == 'N' ? 13 : c == 'O' ? 14 : c == 'P' ? 15 :
            c == 'Q' ? 16 : c == 'R' ? 17 : c == 'S' ? 18 : c == 'T' ? 19 :
            c == 'U' ? 20 : c == 'V' ? 21 : c == 'W' ? 22 : c == 'X' ? 23 :
            c == 'Y' ? 24 : c == 'Z' ? 25 : -1;
#endif
    }
}
#include <iostream>
int main() {
    std::cout << compiler_uses_ascii() << " " << char_index('B') << "\n";
}

when executed outputs:

$ g++ 1.cpp -std=c++20 && ./a.out
1 1
$ g++ 1.cpp -fexec-charset=IBM-1047 -std=c++20 && ./a.out
0@1%

So if I understand what you're saying, it's not possible in user code to check the entire character set? — cigien, Sep 21 '20 at 00:31
It's not possible to check the entire character set because not all characters are used. You can only check the characters that are used. "Encoding" is generally mapping between one byte to another. You want to check if specific encoding is used - well, you can check only characters that are in the mapping, it's not possible to check other bytes, because.. they are not in the map. — KamilCuk, Sep 21 '20 at 06:57

score 0 · Answer 2 · answered Sep 21 '20 at 11:20

0

You can often get encoding information from std::locale("").name() although it'll almost always never be ASCII but some superset of it like UTF-8 or CP1252 (unless the base encoding is some non-ASCII one like EBCDIC). No one uses pure ASCII nowadays

If boost is allowed then boost::locale::info::encoding() gives you more reliable encoding information. You still need to check the encoding to see if covers the ASCII set though

To see if an encoding is a superset of ASCII you can check the list here

answered Sep 21 '20 at 11:20

phuclv

37,963
15
156
475

1

`No one uses pure ASCII nowadays` Well, hm.. With the continuous rise of embedded and IOT devices, those tend to use `newlib` in `-nano` version. They all use pure `ascii`. The `wcs*` functions are just stubs that like copy/truncate bytes and do nothing. Yes, people use pure ascii environments. – KamilCuk Sep 22 '20 at 12:47

IlCapitano · Answer 3 · 2020-09-22T12:24:44.187

To check if the used encoding is ascii compatible, you could use C++11's (or C11's) unicode escape sequences and check if all unicode code points in the range 0x00 to 0x7f resolve to the same integer value.

constexpr bool is_ascii_compatible()
{
    return '\u0000' == 0x00 && '\u0001' == 0x01 && '\u0002' == 0x02 && '\u0003' == 0x03 &&
           '\u0004' == 0x04 && '\u0005' == 0x05 && '\u0006' == 0x06 && '\u0007' == 0x07 &&
           '\u0008' == 0x08 && '\u0009' == 0x09 && '\u000a' == 0x0a && '\u000b' == 0x0b &&
           '\u000c' == 0x0c && '\u000d' == 0x0d && '\u000e' == 0x0e && '\u000f' == 0x0f &&
           '\u0010' == 0x10 && '\u0011' == 0x11 && '\u0012' == 0x12 && '\u0013' == 0x13 &&
           '\u0014' == 0x14 && '\u0015' == 0x15 && '\u0016' == 0x16 && '\u0017' == 0x17 &&
           '\u0018' == 0x18 && '\u0019' == 0x19 && '\u001a' == 0x1a && '\u001b' == 0x1b &&
           '\u001c' == 0x1c && '\u001d' == 0x1d && '\u001e' == 0x1e && '\u001f' == 0x1f &&
           '\u0020' == 0x20 && '\u0021' == 0x21 && '\u0022' == 0x22 && '\u0023' == 0x23 &&
           '\u0024' == 0x24 && '\u0025' == 0x25 && '\u0026' == 0x26 && '\u0027' == 0x27 &&
           '\u0028' == 0x28 && '\u0029' == 0x29 && '\u002a' == 0x2a && '\u002b' == 0x2b &&
           '\u002c' == 0x2c && '\u002d' == 0x2d && '\u002e' == 0x2e && '\u002f' == 0x2f &&
           '\u0030' == 0x30 && '\u0031' == 0x31 && '\u0032' == 0x32 && '\u0033' == 0x33 &&
           '\u0034' == 0x34 && '\u0035' == 0x35 && '\u0036' == 0x36 && '\u0037' == 0x37 &&
           '\u0038' == 0x38 && '\u0039' == 0x39 && '\u003a' == 0x3a && '\u003b' == 0x3b &&
           '\u003c' == 0x3c && '\u003d' == 0x3d && '\u003e' == 0x3e && '\u003f' == 0x3f &&
           '\u0040' == 0x40 && '\u0041' == 0x41 && '\u0042' == 0x42 && '\u0043' == 0x43 &&
           '\u0044' == 0x44 && '\u0045' == 0x45 && '\u0046' == 0x46 && '\u0047' == 0x47 &&
           '\u0048' == 0x48 && '\u0049' == 0x49 && '\u004a' == 0x4a && '\u004b' == 0x4b &&
           '\u004c' == 0x4c && '\u004d' == 0x4d && '\u004e' == 0x4e && '\u004f' == 0x4f &&
           '\u0050' == 0x50 && '\u0051' == 0x51 && '\u0052' == 0x52 && '\u0053' == 0x53 &&
           '\u0054' == 0x54 && '\u0055' == 0x55 && '\u0056' == 0x56 && '\u0057' == 0x57 &&
           '\u0058' == 0x58 && '\u0059' == 0x59 && '\u005a' == 0x5a && '\u005b' == 0x5b &&
           '\u005c' == 0x5c && '\u005d' == 0x5d && '\u005e' == 0x5e && '\u005f' == 0x5f &&
           '\u0060' == 0x60 && '\u0061' == 0x61 && '\u0062' == 0x62 && '\u0063' == 0x63 &&
           '\u0064' == 0x64 && '\u0065' == 0x65 && '\u0066' == 0x66 && '\u0067' == 0x67 &&
           '\u0068' == 0x68 && '\u0069' == 0x69 && '\u006a' == 0x6a && '\u006b' == 0x6b &&
           '\u006c' == 0x6c && '\u006d' == 0x6d && '\u006e' == 0x6e && '\u006f' == 0x6f &&
           '\u0070' == 0x70 && '\u0071' == 0x71 && '\u0072' == 0x72 && '\u0073' == 0x73 &&
           '\u0074' == 0x74 && '\u0075' == 0x75 && '\u0076' == 0x76 && '\u0077' == 0x77 &&
           '\u0078' == 0x78 && '\u0079' == 0x79 && '\u007a' == 0x7a && '\u007b' == 0x7b &&
           '\u007c' == 0x7c && '\u007d' == 0x7d && '\u007e' == 0x7e && '\u007f' == 0x7f;
}

Demo here.

Edit:

I'll expand a bit on the solution, as it's a bit vague by itself. As KamilCuk's answer already said, in C and C++, there's a limited number of characters that can appear in a character literal, so something like start of heading (0x01 in ascii) cannot be represented by a character literal (without using unicode escape sequences \u or \U), as there's no escape sequence for it, like there is for some special characters (e.g. new line \n). Using \x01 or \001 wouldn't work either because they only represent a byte of data and not the corresponding ascii character.

By using unicode escape sequences we can represent any ascii character in all supported encodings, because unicode code points in the range 0x00 to 0x7f correspond to the same characters as in ascii. This means that the expression '\u0041' == 'A' must evaluate to true with any execution character set. With this we can test all ascii characters if they resolve to the same integer value, which the code above does.

score -2 · Answer 4 · answered Sep 20 '20 at 19:39

-2

There is no need to check for ASCII specifically. All you are really interested in is getting an index within a sequential alphabet of letters, and you can let the language handle that for you:

int char_index(char c)
{
    if (c >= 'A' && c <= 'Z')
        return c - 'A';
    return -1;
}

answered Sep 20 '20 at 19:39

Remy Lebeau

555,201
31
458
770

3

[EBCDIC](https://en.wikipedia.org/wiki/EBCDIC) does not have A-Z contiguous. – Boann Sep 20 '20 at 19:49
1

@Boann perhaps, but who would ever write their source code in EBCDIC? If you write your source code in a given charset, and configure your compiler to process the source code as that charset, then you can write code that uses character literals in that charset, and assume certain characteristics of that charset, like character order. – Remy Lebeau Sep 21 '20 at 03:13

How can I check if char encoding is ASCII?

4 Answers4