0

Given an utf8 string, how to know it contain specified characters which don't allowed?

The demand is the utf8 string only can contain English characters and Chinese characters. Any other characters like symbols, numbers, white space, '\n' ... are disallowed.

Dose std::regex can do this job?

bool legal(const std::string& s) { // s is utf8 string
   //??
}
jean
  • 2,825
  • 2
  • 35
  • 72
  • Could you give me an example of a Chinese character that *can* be included in a string with UTF-8 encoding? – Bathsheba Jul 20 '17 at 08:51
  • Any Chinese character, any English character are allowed – jean Jul 20 '17 at 08:54
  • You might find it a challenge fitting all Chinese characters into a UTF-8 encoded string. – Bathsheba Jul 20 '17 at 08:55
  • some other language can do this like python. But I don't know std::regex can do this or not. If it can't, seems the only way is check the encoding range of chinese characters – jean Jul 20 '17 at 08:57
  • 1
    You should walk the string, decoding UTF-8 sequences to Unicode codepoint numbers on the fly; then compare them against your allowed ranges. – Matteo Italia Jul 20 '17 at 08:58

1 Answers1

1

You could convert the std::string to a vector of utf32 code points (as described here) and then iterate them and check the ranges (however I cannot provide the utf32 value ranges for Chinese letters and judging from the comments on your question that could actually be an issue).

EDIT

As stated in the comment below, if you know that the characters that you need to validate fall in the 2 byte range you could stick with utf16.

Rudolfs Bundulis
  • 11,636
  • 6
  • 33
  • 71
  • I *thought* you'd tend to use UTF-16 and `std::wstring`. – Bathsheba Jul 20 '17 at 09:05
  • @Bathsheba as I stated in my response I am not very familiar with the actual code ranges, but a quick look a https://stackoverflow.com/questions/9166130/what-are-the-upper-and-lower-bound-for-chinese-char-in-utf-8 shows that the Chinese characters actually go out of the 2byte range, thus, purely from the point of ease, itearating utf32 code points would be more generic. – Rudolfs Bundulis Jul 20 '17 at 09:10
  • It isn't even necessary to actually convert the whole string, iterating over code points in an UTF-8 string is quite easy (actually, it's trivial using a library such as UTF-8 cpp). – Matteo Italia Jul 20 '17 at 15:02