9

Before you get started; yes I know this is a duplicate question and yes I have looked at the posted solutions. My problem is I could not get them to work.

bool invalidChar (char c)
{ 
    return !isprint((unsigned)c); 
}
void stripUnicode(string & str)
{
    str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end()); 
}

I tested this method on "Prusæus, Ægyptians," and it did nothing I also attempted to substitute isprint for isalnum

The real problem occurs when, in another section of my program I convert string->wstring->string. the conversion balks if there are unicode chars in the string->wstring conversion.

Ref:

How can you strip non-ASCII characters from a string? (in C#)

How to strip all non alphanumeric characters from a string in c++?

Edit:

I still would like to remove all non-ASCII chars regardless yet if it helps, here is where I am crashing:

// Convert to wstring
wchar_t* UnicodeTextBuffer = new wchar_t[ANSIWord.length()+1];
wmemset(UnicodeTextBuffer, 0, ANSIWord.length()+1);
mbstowcs(UnicodeTextBuffer, ANSIWord.c_str(), ANSIWord.length());
wWord = UnicodeTextBuffer; //CRASH

Error Dialog

MSVC++ Debug Library

Debug Assertion Failed!

Program: //myproject

File: f:\dd\vctools\crt_bld\self_x86\crt\src\isctype.c

Line: //Above

Expression:(unsigned)(c+1)<=256

Edit:

Further compounding the matter: the .txt file I am reading in from is ANSI encoded. Everything within should be valid.

Solution:

bool invalidChar (char c) 
{  
    return !(c>=0 && c <128);   
} 
void stripUnicode(string & str) 
{ 
    str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end());  
}

If someone else would like to copy/paste this, I can check this question off.

EDIT:

For future reference: try using the __isascii, iswascii commands

Community
  • 1
  • 1
AnthonyW
  • 1,910
  • 5
  • 25
  • 46
  • What happens if you change invalidChar to always return true and what happens when it's always false. Additionally log what ivalidChar gets and it's output. – Daniel Apr 16 '12 at 17:26
  • @Dani On it... (more chars to post) – AnthonyW Apr 16 '12 at 17:28
  • 1
    Make sure you call `setlocale("");` before you do the conversion. There's no point in a conversion if it can't handle non-ASCII characters, is there! – Kerrek SB Apr 16 '12 at 17:29
  • @ Dani setting invalidChar to `return true` kicks out a blank string while `false` does nothing. I too suspected that to be the problem yet I am unsure what method to use other that `isprint` and `isalnum` as they do not seem to be getting the job done. – AnthonyW Apr 16 '12 at 17:33
  • @KerrekSB I have this: `setlocale(LC_ALL, ""); ` a few lines further down than the line that throws an error. I use it for converting wstring->string. Are you saying I should move that up a few lines? – AnthonyW Apr 16 '12 at 17:35
  • Yes, it must be the first thing in your program! – Kerrek SB Apr 16 '12 at 17:37
  • Is your environment's locale set to something useful? Try a few of the popular ones (`ISO-8859-15`, `UTF-8`). – Kerrek SB Apr 16 '12 at 17:51
  • @KerrekSB To be frank, I am not that familiar with what `setlocale` actually does. I'll try putting 'setlocale(LC_ALL, "ISO-8859-15");` into the first line of `main` – AnthonyW Apr 16 '12 at 18:01
  • @KerrekSB I may be doing it wrong but neither the above nor `setlocale(LC_ALL, "UTF-8");` ahd any effect. – AnthonyW Apr 16 '12 at 18:03
  • If you leave the `""` in, you can just set the locale in your shell: `LC_ALL=en_GB.utf8 ./myprog` – Kerrek SB Apr 16 '12 at 18:04

4 Answers4

13

Solution:

bool invalidChar (char c) 
{  
    return !(c>=0 && c <128);   
} 
void stripUnicode(string & str) 
{ 
    str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end());  
}

EDIT:

For future reference: try using the __isascii, iswascii commands

AnthonyW
  • 1,910
  • 5
  • 25
  • 46
  • 1
    Shouldn't this be `unsigned char` instead of `char`? a regular `char` will always be < 128... ``` #include int main(){ char c = 129; std::cout << (c<128) << "\n"; return 0; }``` – CIsForCookies May 10 '23 at 14:48
2

At least one problem is in your invalidChar function. It should be:

return !isprint( static_cast<unsigned char>( c ) );

Casting a char to an unsigned is likely to give some very, very big values if the char is negative (UNIT_MAX+1 + c). Passing such a value toisprint` is undefined behavior.

James Kanze
  • 150,581
  • 18
  • 184
  • 329
  • switching the method as prescribed fixes `Prusæus` but not `Ægyptians` which still causes a crash. – AnthonyW Apr 16 '12 at 17:39
  • @AnthonyW If `c` has type `char`, and you're on an Intel platform, then casting it to `unsigned char` before calling `isprint` should make that part of the code work. Of course, there is still the problem as to what you mean by ASCII; the definition I'd use is `c >= 0 && c < 128` (but this includes non-printable ASCII like EOT or DEL). – James Kanze Apr 16 '12 at 18:00
  • Yes, that is the char set I am looking for. Unless I am mistaken `Æ` is not a member, yet it refuses to be removed. Of couse, I could be mistaken in which case I need a different approach. – AnthonyW Apr 16 '12 at 18:10
  • Switched statement to `return !(c>=0 && c <128); ` <-- this removes it. Apparently `Æ` is Extended ASCII Character 146 and falls into the system's check for `<256`. However, even with that the case, that does not explain the Error Dialog above which claimes `Æ` is outside the range. – AnthonyW Apr 16 '12 at 18:17
  • There must be 2 versions of that char as even with a check for `<256` it is removed. – AnthonyW Apr 16 '12 at 18:59
  • @AnthonyW For the 95 printable ASCII, and the 33 control characters, the encoding is more or less universal (except for EBCDIC on the mainframes); all of the usual encodings use the same codes, in the range 0...127. For anything else (and thus for `æ` and `Æ`), the actual value will depend on the encoding; the value won't be the same in latin 1 as in UTF-8, for example (and in UTF-8, they will use a multibyte encoding). What `isprint` does with them will depend on the locale. – James Kanze Apr 17 '12 at 07:48
  • @AnthonyW With regards to a possible error: if `char` is signed, then it can't contain 146; if you convert it to `int`, the results will be -110. Calling `isprint` with a negative number (other than `EOF`, probably -1) is undefined behavior. Casting it to `unsigned char` results in the -110 being converted to 146, and the following conversion to `int` should preserve this value. What `isprint` returns when passed 146 will depend on the locale, but it should **not** crash. – James Kanze Apr 17 '12 at 07:53
1

isprint depends on the locale, so the character in question must be printable in the current locale.

If you want strictly ASCII, check the range for [0..127]. If you want printable ASCII, check the range and isprint.

Adrian McCarthy
  • 45,555
  • 16
  • 123
  • 175
1

Another solution that doesn't require defining two functions but uses anonymous functions available in C++17 above:

void stripUnicode(string & str) 
{ 
    str.erase(remove_if(str.begin(),str.end(), [](char c){return !(c>=0 && c <128);}), str.end());  
}

I think it looks cleaner

Fnr
  • 2,096
  • 7
  • 41
  • 76