How to compare char between 128 to 256 in ASCII in c++?

Question

I create a function in c++ to filter some characters, however it's don't work with character between 128 to 256 in ASCII.

string parseString(string str) {
    string result = "";
    string temp = "";

    for (int i = 0; i < str.size(); ++i) {
        if ((str[i] >= 'a' && str[i] <= 'z') || (str[i] >= 'A' && str[i] <= 'Z') || (str[i] >= 'á' && str[i] <= 'û') || (str[i] >= 160 && str[i] <= 165) || (str[i] >= 198 && str[i] <= 199) || str[i] == 39) {
            result += tolower(str[i]);
        }
    }

    return result;
}

Some examples:

parseString('word@#$%¨$%#$@#%$'); //return word
// however
parseString('Fréderic'); //return Frederic, however the function don't filter 130 character

How can I use AscII 256 in c++?

Start with [How do I properly use std::string on UTF-8 in C++? - Stack Overflow](https://stackoverflow.com/questions/50403342/how-do-i-properly-use-stdstring-on-utf-8-in-c). — user202729, Jun 19 '22 at 11:25
`'word@#$%¨$%#$@#%$'` and `'Fréderic'` are not a string literals. — fabian, Jun 19 '22 at 11:26
`'word@#$%¨$%#$@#%$'` is not a string, it's a "char" or actually `int`. And modern code typically uses UTF-8 so you have to handle Unicode properly — phuclv, Jun 19 '22 at 11:26
Or [c++ - set encoding of string literals to latin1 with gcc - Stack Overflow](https://stackoverflow.com/questions/25107831/set-encoding-of-string-literals-to-latin1-with-gcc) / [How to convert a String from UTF8 to Latin1 in C/C++? - Stack Overflow](https://stackoverflow.com/questions/12855643/how-to-convert-a-string-from-utf8-to-latin1-in-c-c). — user202729, Jun 19 '22 at 11:29
*it doesn't work with character between 128 to 256 in ASCII* Nor should it, since ASCII is a 7-bit encoding going from 0 to 127. So anything in the 128 to 256 range isn't ASCII. — Eljay, Jun 19 '22 at 11:31
Also `char` doesn't have any characters between 128 and 256, not everywhere. `char` can be signed or unsigned. — Goswin von Brederlow, Jun 19 '22 at 14:47
@user202729 please please don't use Latin1. For example printing Latin1 encoded strings to your utf-8 terminal might print the incorrect characters. Let us embrace the Unicode standard as the one standard to rule all them all (the languages). — Jakob Stark, Jun 19 '22 at 15:38

score 2 · Answer 1 · answered Jun 19 '22 at 11:27

The type char is used as the elements of std::string. Whether char is signed depends on the environment. You should cast the value to unsigned char before comparision.

string parseString(string str) {
    string result = "";
    string temp = "";

    for (int i = 0; i < str.size(); ++i) {
        unsigned char c = static_cast<unsigned char>(str[i]);
        if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || (c >= 'á' && c <= 'û') || (c >= 160 && c <= 165) || (c >= 198 && c <= 199) || c == 39) {
            result += tolower(c);
        }
    }

    return result;
}

Then, single quotations '' is for character constants in C++. You should use double quotations "" to express strings.

parseString("word@#$%¨$%#$@#%$");
parseString("Fréderic");

Even after this change, your code (especially the part (c >= 'á' && c <= 'û') may not work if you are using character set that uses multiple bytes to express á and û (for example, UTF-8).

Jakob Stark · Answer 2 · 2022-06-20T12:04:48.930

The problem of comparing non-ascii characters is challenging. You have to account for the following issues:

There are different encodings (ASCII is not one of them) that can encode the character 'é'. Possible are for example:
- in ISO/IEC 8859 as 1 byte 0xe9
- in Unicode UTF-16 as 2 bytes 0x00 0xe9
- in Unicode UTF-8 as 2 bytes 0xc3 0xa9
In Unicode, which is most likely what you are using, there are even more than one possible code point sequences for é:
- From the Latin-1 Supplement block:
  U+0039 (0x00 0xe9 in UTF-16 or 0xc3 0xa9 in UTF-8)
- A normal e combined with a diacritical mark:
  U+0065 U+0301 (0x00 0x65 0x03 0x01 in UTF-16 or 0x65 0xcc 0x81 in UTF-8)

The first problem is easy to take care of. You need to find out what character encoding your text editor is using to write the é to the source file. And you have to adjust your algorithm to work with multibyte characters. For UTF-8 for example each multibyte character starts with a binary 0b11... byte and ends with an 0b10... byte. The single character bytes are 0b0....

The second problem is only a problem if you work with user input. Unicode defines a procedure known as normalization, that transforms equivalent code point sequences (like the both ways to represent 'é' above) into a canonical form, that can be used for comparison.

If you now think (like I would), that this is way to much complicated stuff, I would recommend using a string library that is able to deal with those kind of things properly. A starting could be the answers to this question.

score -1 · Answer 3 · answered Jun 19 '22 at 11:26

-1

There is no é in ASCII. ASCII stops at 127.

But in Unicode é is 233. That might explain why your filter fails. But to really understand what is happening you need to know what encoding your compiler is using.

Try this code

cout << (int)'é' << '\n';

and see what number it prints.

answered Jun 19 '22 at 11:26

john

85,011
4
57
81

3

doing this is a bad idea. In many encodings like UTF-8 `'é'` would be a multibyte character – phuclv Jun 19 '22 at 11:35
@phuclv which should give a compiler error like e.g. [here](https://godbolt.org/z/1KvPse3oa) – Jakob Stark Jun 19 '22 at 15:19

How to compare char between 128 to 256 in ASCII in c++?

3 Answers3