Clean a string for non alphabetical charactersma

Question

I am trying to clean a string in C++. I would like to clean it for all non-alphabetical characters, and leave all kinds english AND non-english letters untouched. One of my test-codes looks like this

int main()
{
string test = "Danish letters: Æ Ø Å !!!!!!??||~";
cout << "Test = " << test << endl;

for(int l = 0;l<test.size();l++)
{
    if(!isalpha(test.at(l)) && test.at(l) != ' ')
    {
        test.replace(l,1," nope");  
    }
}

cout << "Test = " << test << endl;

return 0;

}

Which gives me the output:

Test = Danish letters: Æ Ø Å !!!!!!??||~
Test = Danish letters nope  nope nope  nope nope  nope nope  nope nope nope nope nope nope nope nope nope nope nope"

So my question is, how do I remove the "!!!!!!??||~" and not the "Æ Ø Å"?

I've also tried tests like

test.at(l)!='Å'

but my I can't compile, if I declare 'Å' as a char.

I've read about unicode and utf8, but I don't really understand it.

Please help me out :)

Well, you need to keep reading about unicode and utf8 until you do understand it, and then everything should be crystal clear. — Sam Varshavchik, Oct 01 '16 at 20:40
You might want to look at the SO question titled [How to strip all non alphanumeric characters from a string](http://stackoverflow.com/questions/6319872/how-to-strip-all-non-alphanumeric-characters-from-a-string-in-c). I am also interested to see if [std::isalnum](http://en.cppreference.com/w/cpp/string/byte/isalnum) is of use in your case. — , Oct 01 '16 at 20:49
@RawN: Both of those links are for ASCII only, and this question is (implicitly) about non-ASCII. — Mooing Duck, Oct 01 '16 at 22:16
@TomBlodget: Technically, you're correct. Technically they only work for a legacy subset of character encodings. They don't work for UNicode characters, which this code is probably doing with. — Mooing Duck, Oct 03 '16 at 01:47
"can't compile, if I declare 'Å' as a char"—Make sure your compiler is reading your source file with the encoding you are saving it with. Then, if the problem still occurs, you'll know it's because [Å](http://www.fileformat.info/info/unicode/char/00c5/index.htm) is not one `char` in the target character encoding. — Tom Blodget, Oct 03 '16 at 02:26

Mo Abdul-Hameed · Accepted Answer · 2016-10-01T22:38:58.390

1

char is used for ASCII character set, and you are trying to make operations on strings that have non-ASCII characters.

You are making operations on Unicode characters, so you need to use wide string operations:

int main()
{
    wstring test = L"Danish letters: Æ Ø Å !!!!!!??||~";
    wcout << L"Test = " << test << endl;

    for(int i = 0; i < test.size(); i++) {

        if(!iswalpha(test.at(i)) && test.at(i) != ' ') {

            test.replace(i,1,L" nope");
        }
    }

    wcout << L"Test = " << test << endl;

    return 0;
}

You can also make use of Qt and use QString, so the same peace of code will become:

QString test = "Danish letters: Æ Ø Å !!!!!!??||~";
qDebug() << "Test =" << test;

for(int i = 0; i < test.size(); i++) {

    if(!test.at(i).isLetterOrNumber() && test.at(i) != ' ') {

        test.replace(i, 1, " nope");
    }
}

qDebug() << "Test = " << test;

edited Oct 01 '16 at 22:38

answered Oct 01 '16 at 22:13

Mo Abdul-Hameed

6,030
2
23
36

Yes, this code only leaves English and non-English characters because we are using iswalpha. – Mo Abdul-Hameed Oct 01 '16 at 22:24
1

Wow, my example of emoji was very badlly thought out. Starting over: C++ wide functions and classes only work on the basic multilingual plane, and fail when given characters in supplementary planes, which currently contains 73000 characters, some of which are bound to be alphabetical characters. `iswalpha` is _broken_. https://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Multilingual_Plane – Mooing Duck Oct 01 '16 at 22:30
@MooingDuck The wide character API works with an *implementation-defined* fixed-width encoding that might have nothing to do with Unicode. It can be based on UTF-16 like on Windows with the effect that characters beyond the BMP aren't handled properly, or it can use UTF-32 like on Linux which makes full Unicode support possible. Or it can use a completely different character set. – nwellnhof Oct 03 '16 at 11:35
@nwellnhof: I forgot how implementation defined wide characters are. You're right, for 4byte wides, then yes, they can cleanly handle all of Unicode. But for 2 byte wides, there's no possible implementation to handle all of Unicode. – Mooing Duck Oct 03 '16 at 16:24

score 1 · Answer 2 · answered Oct 01 '16 at 23:31

Here is a code example, you can play with different locale and experiment so that you can get what you want.You may experiment with u16string, u32string, etc. Working with locale is a bit confusing at the beginning. Most people program in ASCII.

in your main function call the one I wrote

#include <iostream>
#include <string>
#include <codecvt>
#include <sstream>
#include <locale>

wstring test = L"Danish letters: Æ Ø Å !!!!!!??||~ Πυθαγόρας ὁ Σάμιος";
removeNonAlpha(test);


wstring removeNonAlpha(const wstring &input) {
   typedef codecvt<wchar_t, char, mbstate_t> Cvt;
   locale utf8locale(locale(), new codecvt_byname<wchar_t, char, mbstate_t> ("en_US.UTF-8"));
   wcout.imbue(utf8locale);
   wcout << input << endl;
   wstring res;
   std::locale loc2("en_US.UTF8");
   for(wstring::size_type l = 0; l<input.size(); l++) {
      if(isalpha(input[l], loc2) || isspace(input[l], loc2)) {
         cout << "is char\n";
         res += input[l];
      }
      else {
         cout << "is not char\n";
      }
   }
   wcout << L"Hello, wide to multybyte world!" << endl;
   wcout << res << endl;
   cout << std::isalpha(L'Я', loc2) << endl;
   return res;
}

`wchar_t` is not guaranteed to be wide enough to represent an Unicode code point. On Windows it is 16 bit and represents an UTF-16 code unit, not a code point. — roeland, Oct 03 '16 at 20:58

Clean a string for non alphabetical charactersma

2 Answers2