30

This question is an extension of Do C++11 regular expressions work with UTF-8 strings?

#include <regex>  
if (std::regex_match ("中", std::regex("中") ))  // "\u4e2d" also works
  std::cout << "matched\n";

The program is compiled on Mac Mountain Lion with clang++ with the following options:

clang++ -std=c++0x -stdlib=libc++

The code above works. This is a standard range regex "[一-龠々〆ヵヶ]" for matching any Japanese Kanji or Chinese character. It works in Javascript and Ruby, but I can't seem to get ranges working in C++11, even with using a similar version [\u4E00-\u9fa0]. The code below does not match the string.

if (std::regex_match ("中", std::regex("[一-龠々〆ヵヶ]")))
  std::cout << "range matched\n";

Changing locale hasn't helped either. Any ideas?

EDIT

So I have found that all ranges work if you add a + to the end. In this case [一-龠々〆ヵヶ]+, but if you add {1} [一-龠々〆ヵヶ]{1} it does not work. Moreover, it seems to overreach it's boundaries. It won't match latin characters, but it will match which is \u306f and which is \u3041. They both lie below \u4E00

nhahtdh also suggested regex_search which also works without adding + but it still runs into the same problem as above by pulling values outside of its range. Played with the locales a bit as well. Mark Ransom suggests it treats the UTF-8 string as a dumb set of bytes, I think this is possibly what it is doing.

Further pushing the theory that UTF-8 is getting jumbled some how, [a-z]{1} and [a-z]+ matches a, but only [一-龠々〆ヵヶ]+ matches any of the characters, not [一-龠々〆ヵヶ]{1}.

Community
  • 1
  • 1
MCH
  • 2,124
  • 1
  • 19
  • 34
  • What is the compiler? – nhahtdh Apr 08 '13 at 15:23
  • clang++ -std=c++0x -stdlib=libc++ on Mac Mountain Lion – MCH Apr 08 '13 at 15:40
  • Some experimentation and I have found a solution, add `+` to the end of the range – MCH Apr 08 '13 at 15:52
  • 1
    What's your locale? If you use the default it probably treats the UTF-8 string as a collection of dumb bytes and the multi-byte sequences will be split into pieces. – Mark Ransom Apr 08 '13 at 16:30
  • 1
    I suspect it might be something like that I have tried setting the global local `std::locale::global(std::locale("ja_JP.UTF-8"));` and imbuing `imbue (std::locale("ja_JP.UTF-8"));` with the same results as I show in the EDIT section. Also tried ja_JP,ja_JP.eucJP, and ja_JP.SJIS. – MCH Apr 08 '13 at 16:43
  • 2
    `std::string` is a **byte** string. Why should multi-byte characters work? Use a library such as [ogonek](https://bitbucket.org/martinhofernandes/ogonek) if you want to work with Unicode characters. – Konrad Rudolph Apr 08 '13 at 18:00
  • @KonradRudolph, it appears on Linux `std::string` usually is used to store UTF8, and it's recommended to do so on windows too: http://www.utf8everywhere.org/ – Qtax Apr 08 '13 at 20:55
  • 2
    @Qtax Right, but we’re doing more than *storing* here, we’re *manipulating* (or at least *analysing*) the string. And `std::regex` simply analyses the underlying code units, and if those happen to be bytes then it handles bytes. That’s fine as long as something that we want to treat as a unit doesn’t exceed a byte. The “UTF8 everywhere” advice is good but applies only to transparent storage of strings (which is enough most of the time) when you just retrieve the string from one point and pass it on to another, without doing anything else with it. – Konrad Rudolph Apr 08 '13 at 21:02
  • Just as a small update: The library I mentioned earlier – Ogonek – isn’t actually production ready. The state of the art library is [ICU](http://site.icu-project.org/) but the interface was unfortunately designed by a masochist … once Ogonek *is* ready we’ll have a proper library. – Konrad Rudolph Apr 08 '13 at 21:53
  • Thanks Konrad. Since I am just determining whether each glyph is Kanji, Katakana, Hiragana or Latin, I decided to just check whether the Unicode value is in a certain range. While not as pretty as using regex it works well and I don't need any outside dependencies. It's a shame that C++11 cannot handle it properly on its own. If I do need regex in the future I will definitely check out Ogonek and ICU. – MCH Apr 09 '13 at 02:45
  • 1
    @KonradRudolph Arguably Ogonek is also being designed by a masochist - our beloved Robot. The point you might want to make is that the design is/will be less _sadist_ instead :) – sehe Apr 09 '13 at 11:09

1 Answers1

37

Encoded in UTF-8, the string "[一-龠々〆ヵヶ]" is equal to this one: "[\xe4\xb8\x80-\xe9\xbe\xa0\xe3\x80\x85\xe3\x80\x86\xe3\x83\xb5\xe3\x83\xb6]". And this is not the droid character class you are looking for.

The character class you are looking for is the one that includes:

  • any character in the range U+4E00..U+9FA0; or
  • any of the characters 々, 〆, ヵ, ヶ.

The character class you specified is the one that includes:

  • any of the "characters" \xe4 or \xb8; or
  • any "character" in the range \x80..\xe9; or
  • any of the "characters" \xbe, \xa0, \xe3, \x80, \x85, \xe3 (again), \x80 (again), \x86, \xe3 (again), \x83, \xb5, \xe3 (again), \x83 (again), \xb6.

Messy isn't it? Do you see the problem?

This will not match "latin" characters (which I assume you mean things like a-z) because in UTF-8 those all use a single byte below 0x80, and none of those is in that messy character class.

It will not match "中" either because "中" has three "characters", and your regex matches only one "character" out of that weird long list. Try assert(std::regex_match("中", std::regex("..."))) and you will see.

If you add a + it works because "中" has three of those "characters" in your weird long list, and now your regex matches one or more.

If you instead add {1} it does not match because we are back to matching three "characters" against one.

Incidentally "中" matches "中" because we are matching the three "characters" against the same three "characters" in the same order.

That the regex with + will actually match some undesired things because it does not care about order. Any character that can be made from that list of bytes in UTF-8 will match. It will match "\xe3\x81\x81" (ぁ U+3041) and it will even match invalid UTF-8 input like "\xe3\xe3\xe3\xe3".

The bigger problem is that you are using a regex library that does not even have level 1 support for Unicode, the bare minimum required. It munges bytes and there isn't much your precious tiny regex can do about it.

And the even bigger problem is that you are using a hardcoded set of characters to specify "any Japanese Kanji or Chinese character". Why not use the Unicode Script property for that?

R"(\p{Script=Han})"

Oh right, this won't work with C++11 regexes. For a moment there I almost forgot those are annoyingly worse than useless with Unicode.

So what should you do?

You could decode your input into a std::u32string and use char32_t all over for the matching. That would not give you this mess, but you would still be hardcoding ranges and exceptions when you mean "a set of characters that share a certain property".

I recommend you forget about C++11 regexes and use some regular expression library that has the bare minimum level 1 Unicode support, like the one in ICU.

R. Martinho Fernandes
  • 228,013
  • 71
  • 433
  • 510
  • Thank you Martinho, that is a very informative post. Gives me and even better understanding of UTF-8 and regular expressions. Anyway, I abandoned regex in this project since I only need to know if a glyph belongs to a particular range, and than tag it to that range so hardcoding is a quick and easy solution. I thought regexes would be a simple and elegant solution, but I found that that does not hold for C++11. – MCH Apr 09 '13 at 07:52
  • 9
    @MCH Yeah, I guess grabbing ICU for one tiny little match might be too much. If you want to use Unicode and regexes, Perl is pretty much the only language that takes it seriously. It is a sad state of affairs, but it's what we have. Personally I think `` is some more garbage in the stdlib. It's 2013 and pretending Unicode does not exist is facetious and only contributes to this idea that dealing with Unicode is too painful to care (hint: if your hammer does not have a head, you will have a hard time driving nails). – R. Martinho Fernandes Apr 09 '13 at 07:54
  • Use wregex instead and either use http://utfcpp.sourceforge.net/ or prefix your strings with 'L'. –  Apr 12 '13 at 15:34
  • 1
    I thought Go was taking it seriously, at last UTF-8, http://golang.org/pkg/regexp. – oblitum Apr 15 '13 at 22:57
  • @chico nice shout. I haven't tried re in Go, so maybe I should give it a try :) – R. Martinho Fernandes Apr 16 '13 at 07:46
  • @R.MartinhoFernandes what are the flaws in Python 3's Unicode regex support? – Bob Kline Jun 11 '20 at 12:41