0

I create a function in c++ to filter some characters, however it's don't work with character between 128 to 256 in ASCII.

string parseString(string str) {
    string result = "";
    string temp = "";

    for (int i = 0; i < str.size(); ++i) {
        if ((str[i] >= 'a' && str[i] <= 'z') || (str[i] >= 'A' && str[i] <= 'Z') || (str[i] >= 'á' && str[i] <= 'û') || (str[i] >= 160 && str[i] <= 165) || (str[i] >= 198 && str[i] <= 199) || str[i] == 39) {
            result += tolower(str[i]);
        }
    }

    return result;
}

Some examples:

parseString('word@#$%¨$%#$@#%$'); //return word
// however
parseString('Fréderic'); //return Frederic, however the function don't filter 130 character

How can I use AscII 256 in c++?

Charles Braga
  • 488
  • 1
  • 4
  • 11
  • Start with [How do I properly use std::string on UTF-8 in C++? - Stack Overflow](https://stackoverflow.com/questions/50403342/how-do-i-properly-use-stdstring-on-utf-8-in-c). – user202729 Jun 19 '22 at 11:25
  • 6
    `'word@#$%¨$%#$@#%$'` and `'Fréderic'` are not a string literals. – fabian Jun 19 '22 at 11:26
  • 1
    `'word@#$%¨$%#$@#%$'` is not a string, it's a "char" or actually `int`. And modern code typically uses UTF-8 so you have to handle Unicode properly – phuclv Jun 19 '22 at 11:26
  • Or [c++ - set encoding of string literals to latin1 with gcc - Stack Overflow](https://stackoverflow.com/questions/25107831/set-encoding-of-string-literals-to-latin1-with-gcc) / [How to convert a String from UTF8 to Latin1 in C/C++? - Stack Overflow](https://stackoverflow.com/questions/12855643/how-to-convert-a-string-from-utf8-to-latin1-in-c-c). – user202729 Jun 19 '22 at 11:29
  • 2
    *it doesn't work with character between 128 to 256 in ASCII* Nor should it, since ASCII is a 7-bit encoding going from 0 to 127. So anything in the 128 to 256 range isn't ASCII. – Eljay Jun 19 '22 at 11:31
  • 1
    Also `char` doesn't have any characters between 128 and 256, not everywhere. `char` can be signed or unsigned. – Goswin von Brederlow Jun 19 '22 at 14:47
  • 2
    @user202729 please please don't use Latin1. For example printing Latin1 encoded strings to your utf-8 terminal might print the incorrect characters. Let us embrace the Unicode standard as the one standard to rule all them all (the languages). – Jakob Stark Jun 19 '22 at 15:38

3 Answers3

2

The type char is used as the elements of std::string. Whether char is signed depends on the environment. You should cast the value to unsigned char before comparision.

string parseString(string str) {
    string result = "";
    string temp = "";

    for (int i = 0; i < str.size(); ++i) {
        unsigned char c = static_cast<unsigned char>(str[i]);
        if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || (c >= 'á' && c <= 'û') || (c >= 160 && c <= 165) || (c >= 198 && c <= 199) || c == 39) {
            result += tolower(c);
        }
    }

    return result;
}

Then, single quotations '' is for character constants in C++. You should use double quotations "" to express strings.

parseString("word@#$%¨$%#$@#%$");
parseString("Fréderic");

Even after this change, your code (especially the part (c >= 'á' && c <= 'û') may not work if you are using character set that uses multiple bytes to express á and û (for example, UTF-8).

MikeCAT
  • 73,922
  • 11
  • 45
  • 70
1

The problem of comparing non-ascii characters is challenging. You have to account for the following issues:

  1. There are different encodings (ASCII is not one of them) that can encode the character 'é'. Possible are for example:

    • in ISO/IEC 8859 as 1 byte 0xe9
    • in Unicode UTF-16 as 2 bytes 0x00 0xe9
    • in Unicode UTF-8 as 2 bytes 0xc3 0xa9
  2. In Unicode, which is most likely what you are using, there are even more than one possible code point sequences for é:

    • From the Latin-1 Supplement block:
      U+0039 (0x00 0xe9 in UTF-16 or 0xc3 0xa9 in UTF-8)
    • A normal e combined with a diacritical mark:
      U+0065 U+0301 (0x00 0x65 0x03 0x01 in UTF-16 or 0x65 0xcc 0x81 in UTF-8)

The first problem is easy to take care of. You need to find out what character encoding your text editor is using to write the é to the source file. And you have to adjust your algorithm to work with multibyte characters. For UTF-8 for example each multibyte character starts with a binary 0b11... byte and ends with an 0b10... byte. The single character bytes are 0b0....

The second problem is only a problem if you work with user input. Unicode defines a procedure known as normalization, that transforms equivalent code point sequences (like the both ways to represent 'é' above) into a canonical form, that can be used for comparison.

If you now think (like I would), that this is way to much complicated stuff, I would recommend using a string library that is able to deal with those kind of things properly. A starting could be the answers to this question.

Jakob Stark
  • 3,346
  • 6
  • 22
-1

There is no é in ASCII. ASCII stops at 127.

But in Unicode é is 233. That might explain why your filter fails. But to really understand what is happening you need to know what encoding your compiler is using.

Try this code

cout << (int)'é' << '\n';

and see what number it prints.

john
  • 85,011
  • 4
  • 57
  • 81