How to search a non-ASCII character in a c++ string?

Question

string s="x1→(y1⊕y2)∧z3";

for(auto i=s.begin(); i!=s.end();i++){
    if(*i=='→'){
       ...
    }
}

The char comparing is definitely wrong, what's the correct way to do it? I am using vs2013.

I am pretty sure you are going to need to use wide characters to do this — NathanOliver, Jul 14 '15 at 01:32
You need to decide what character size and encoding is being used. — Hot Licks, Jul 14 '15 at 01:47
@NathanOliver: I tried wstring, but I have found that sizeof('→') is 4, while wchar is 2 byte, so wstring perhaps doesn't work. — yangwenjin, Jul 14 '15 at 01:49
@yangwenjin and sizeof('a') is 1 (for C++). `'→'` is not a valid character constant, as it is not a character which can be represented in 1 `char`. And also, `((char) '→' == (char) 'ʒ')` might be **true**. — roeland, Jul 14 '15 at 01:54
I also have VS2013 and sizeof('→') gives me 1, not 4. I also get a warning about the constant. When I use L'→' (wide character) the warning is gone, so that's probably the way to go. — user1610015, Jul 14 '15 at 02:12
`whcar_t` will have the same problem for Unicode characters above U+FFFF. — roeland, Jul 14 '15 at 02:17
Yeah but → is not above FFFF. Also the compiler will warn whenever you try to fit a character into a constant that can't hold the character. — user1610015, Jul 14 '15 at 02:30
@chris it's not required that the compiler support utf-8 encoded source — M.M, Jul 14 '15 at 05:12
@MattMcNabb, Of course, but it's not required that you use wide characters. UTF-8 is just one example of an alternative, but I'd expect the use of a proper library with it. — chris, Jul 14 '15 at 05:53
@roeland: That depends a lot on the compiler. If `wchar_t` is Unicode, it won't be a problem. (and sizeof will be 4). However, on VC++ it's not actually Unicode. This is because historically they've set `sizeof(wchar_t)` too low and they don't dare increase it. — MSalters, Jul 14 '15 at 07:32
What you need to do is write a little routine which iterates through your string and prints out the numeric value of each character. Then you will better understand what's going on (and be better prepared to debug it). — Hot Licks, Jul 14 '15 at 12:32

score 3 · Answer 1 · edited May 23 '17 at 12:06

First you need some basic understanding of how programs handle Unicode. Otherwise, you should read up, I quite like this post on Joel on Software.

You actually have 2 problems here:

Problem #1: getting the string into your program

Your first problem is getting that actual string in your string s. Depending on the encoding of your source code file, MSVC may corrupt any non-ASCII characters in that string.

either save your C++ file as UTF-16 (which Windows confusingly calls Unicode), and use whcar_t and wstring (effectively encoding the expression as UTF-16). Saving as UTF-8 with BOM will also work. Any other encoding and your L"..." character literals will contain the wrong characters.

Note that other platforms may define wchar_t as 4 bytes instead of 2. So the handling of characters above U+FFFF will be non-portable.
In all other cases, you can't just write those characters in your source file. The most portable way is encoding your string literals as UTF-8, using \x escape codes for all non-ASCII characters. Like this: "x1\xe2\x86\x92a\xe2\x8a\x95" "b)" rather than "x1→(a⊕b)".

And yes, that's as unreadable and cumbersome as it gets. The root problem is MSVC doesn't really support using UTF-8. You can go through this question here for an overview: How to create a UTF-8 string literal in Visual C++ 2008 .

But, also consider how often those strings will actually show up in your source code.

Problem #2: finding the character

(If you're using UTF-16, you can just find the L'→' character, since that character is representable as one whcar_t. For characters above U+FFFF you'll have to use the wide version of the workaround below.)

It's impossible to define a char representing the arrow character. You can however with a string: "\xe2\x86\x92". (that's a string with 3 chars for the arrow, and the \0 terminator.

You can now search for this string in your expression:

s.find("\xe2\x86\x92");

The UTF-8 encoding scheme guarantees this always finds the correct character, but keep in mind this is an offset in bytes.

std::string doesn't deal well with UTF-8. If you call s.find(...) and there are other multi-byte characters preceding what you searched for, it will give you an incorrect index. Really you are just complicating this too much. The OP's code can work as-is if it's simply converted to use wide characters. — user1610015, Jul 14 '15 at 03:25
@user1610015 Depends on if you want to compile your application on other platforms, and it depends on if you can have it break on astral characters (characters above U+FFFF). A `whcar_t` on Windows **cannot** encode any Unicode character. — roeland, Jul 14 '15 at 03:41
@user1610015 And, using wide chars doesn't solve problem #1. You may get surprised as to what would actually end up in your string if you just write `L"a→b"`. — roeland, Jul 14 '15 at 04:03
I don't know what you mean. L"a→b" contains the characters a, →, and b. Yes it may not work on rare platforms, but that is the case with almost anything in C++. And Visual Studio has a warning for when the character cannot be encoded, so there is no "surprise" possible. — user1610015, Jul 14 '15 at 04:07
It contains the characters a, →, and b, *only if* the compiler knows the encoding of the source file. Microsoft Visual Studio for that reason always uses the BOM when saving UTF-8. However on Linux C++ files are commonly saved as UTF-8 without BOM (I think g++ used to choke on that BOM). If you compile such file with Visual studio, it assumes ANSI encoding, and then your string will contain (on systems set to English) `L"aâ\u0086\u0092b"`. — roeland, Jul 14 '15 at 04:40

score 1 · Answer 2 · answered Jul 14 '15 at 04:33

My comment is too large, so i am submitting it as an answer.

The problem is that everybody is concentrating on the issue of different encodings that Unicode may use (UTF-8, UTF-16, UCS2, etc). But your problems here will just begin.

There is also an issue of composite characters, which will really mess up any search that you are trying to make.

Let's say you are looking for a character 'é', you find it in Unicode as U+00E9 and do your search, but it is not guaranteed that this is the only way to represent this character. The document may also contain U+0065 U+0301 combination. Which is actually exactly the same character.

Yes, not just "character that looks the same", but it is exactly the same, so any software and even some programming libraries will freely convert from one to another without even telling you.

So if you wish to make a search, that is robust, you will need something that represents not just different encodings of Unicode, but Unicode characters themselves with equality between Composite and Ready-Made chars.

Yes, good point. Additionally if you find the arrow in [ U+0065 U+0301 *U+2192* ], you'll find it's the third code point, but when displayed it's the second glyph. — roeland, Jul 14 '15 at 05:03

How to search a non-ASCII character in a c++ string?

2 Answers2

Linked