check for invalid characters in (possible) chinese strings

Question

So I have this function in a large codebase that checks for invalid characters that looks something like this :

validateMe(std::string myString)
{
  for (int i = 0; i < myString.length(); i++)
  {      
    if ((myString[i] == 0x7E) || ...)
    {
      return NOT_VALID_STRING;
    }
  }
  return VALID_STRING;
}

before calling validateMe, the string was converted to UTF8.

Now, this worked fine until it was tested for Chinese characters.

I'm going through http://utf8everywhere.org/, trying to understand better everything, but its like a pretty deep rabit hole I'm getting into.

I guess I have to somehow find the code points, test if each is in a valid range where the invalid characters are, and if so I can look for the invalid characters. But how do I find the code points?

I've read that std::string should be able to handle this, but

myString.find("~") != std::string::npos

fails with chinese characters, I guess because the first bites of the chinese character are 0x7E. At least the ones I've tried.

So, how to check for invalid characters in a string that could be written in Chinese? Lets assume by Chinese EUC-CN.

EDIT:

validateMe("testme") should pass

validateMe("test~me") should NOT pass

when the user puts the characters "啊是的发" (that is, the first character for each letter in "asdf" in Chinese EUC-CN) through the GUI, the function fails. In fact, it finds "~" or 0x7E. The VS debugger indeed translates the input as å•Šæ˜¯çš„å‘, which has a '~'.

"*I've read that std::string should be able to handle this*" Where did you read that `std::string` had any idea what Unicode is, let alone was able to do codepoint manipulation? `string` stores an array of `char`s; that's it. — Nicol Bolas, Apr 19 '20 at 16:56
here https://stackoverflow.com/questions/50403342/how-do-i-properly-use-stdstring-on-utf-8-in-c — cauchi, Apr 19 '20 at 17:08
I see nothing in the answers to that question which claims *specifically* that `std::string` automatically handles anything related to Unicode. The answer claims only that `wstring` isn't appropriate/portable and that you can use `string` to store and manipulate UTF-8-encoded data. It never says that it handles it for you. Your search pattern should find any usage of the exact byte sequenceyou gave it; I don't think Chinese uses the "~" character much, so I'm not sure why you were expecting to find it. — Nicol Bolas, Apr 19 '20 at 17:42
I don't understand what you want to validate. Can you please demonstrate for what kind of strings your validation function is supposed to fail? `<=0x7F` would match every ASCII character from what I can tell. — walnut, Apr 19 '20 at 17:50
I did a minor update in the code. Any valid string should not have ~ for example. — cauchi, Apr 19 '20 at 19:39
You should give specific example that you think fails with your code. Given that 0x7E does not have the high bit set, it should always be a part of a single byte character in UTF-8 as far as I know. **As written, your question is very poor because we do not have any idea of what your consider to be a failure and it does not even have an example of a case that fails.** — Phil1970, Apr 19 '20 at 20:13
@cauchy Both of your code examples look fine now. Looking for the code point `~` in a UTF-8 encoded string bytewise should work fine without any false positives. Multi-byte code points (such as used for chinese characters) always have the highest bit in each byte set in UTF-8, which `~` (being an ASCII character) does not have. — walnut, Apr 19 '20 at 20:23
@cauchy The debugger is showing the UTF-8 encoded string decoded as codepage 1252 (I guess). The incorrect decoding in the debugger leads to the `˜` character, which is not the same as the `~` character anyway. The UTF-8 encoding of your string is `"\xe5\x95\x8a\xe6\x98\xaf\xe7\x9a\x84"`. There is no `\x7e` (`~`) in there. — walnut, Apr 19 '20 at 21:32

score -4 · Answer 1 · answered Apr 19 '20 at 17:06

-4

You can't use std::string with unicode characters like Chinese, because std::string only supports ASCII characters. Instead, you can use std::wstring.

answered Apr 19 '20 at 17:06

Sergey Zolotykh

19
2

`std::string` stores an array of `char`. And `char` can hold any UTF-8 code unit. So you can interpret any `string` as being UTF-8 and use/manipulate it with respect to UTF-8. – Nicol Bolas Apr 19 '20 at 17:45

check for invalid characters in (possible) chinese strings

1 Answers1