10

How can I match utf8 unicode characters using boost::spirit?

For example, I want to recognize all characters in this string:

$ echo "На берегу пустынных волн" | ./a.out
Н а б е р е гу п у с т ы н н ы х в о л н

When I try this simple boost::spirit program it will not match the unicode characters correctly:

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/support_istream_iterator.hpp>
#include <boost/foreach.hpp>
namespace qi = boost::spirit::qi;

int main() {
  std::cin.unsetf(std::ios::skipws);
  boost::spirit::istream_iterator begin(std::cin);
  boost::spirit::istream_iterator end;

  std::vector<char> letters;
  bool result = qi::phrase_parse(
      begin, end,  // input     
      +qi::char_,  // match every character
      qi::space,   // skip whitespace 
      letters);    // result    

  BOOST_FOREACH(char letter, letters) {
    std::cout << letter << " ";
  }
  std::cout << std::endl;
}

It behaves like this:

$ echo "На берегу пустынных волн" | ./a.out | less
<D0> <9D> <D0> <B0> <D0> <B1> <D0> <B5> <D1> <80> <D0> <B5> <D0> <B3> <D1> <83> <D0> <BF> <D1> <83> <D1> <81> <D1> <82> <D1> <8B> <D0> <BD> <D0> <BD> <D1> <8B> <D1> <85> <D0> 
<B2> <D0> <BE> <D0> <BB> <D0> <BD> 

UPDATE:

Okay, I worked on this a bit more, and the following code is sort of working. It first converts the input into an iterator of 32-bit unicode characters (as recommended here):

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/support_istream_iterator.hpp>
#include <boost/foreach.hpp>
#include <boost/regex/pending/unicode_iterator.hpp>
namespace qi = boost::spirit::qi;

int main() {
  std::string str = "На берегу пустынных волн";
  boost::u8_to_u32_iterator<std::string::const_iterator>
      begin(str.begin()), end(str.end());
  typedef boost::uint32_t uchar; // a unicode code point
  std::vector<uchar> letters;
  bool result = qi::phrase_parse(
      begin, end,             // input
      +qi::standard_wide::char_,  // match every character
      qi::space,              // skip whitespace
      letters);               // result
  BOOST_FOREACH(uchar letter, letters) {
    std::cout << letter << " ";
  }
  std::cout << std::endl;
}

The code prints the Unicode code points:

$ ./a.out 
1053 1072 1073 1077 1088 1077 1075 1091 1087 1091 1089 1090 1099 1085 1085 1099 1093 1074 1086 1083 1085 

which seems to be correct, according to the official Unicode table.

Now, can anyone tell me how to print the actual characters instead, given this vector of Unicode code points?

Frank
  • 64,140
  • 93
  • 237
  • 324
  • I found that it may be possible using the boost regex unicode iterators, which convert the utf8 input to utf32 code points (http://comments.gmane.org/gmane.comp.parsers.spirit.general/23490), and I'm trying to figure out how this works... Any help is appreciated. – Frank May 06 '12 at 23:38
  • Also, elements from namespace `boost::spirit::unicode` are used here (http://boost-spirit.com/dl_more/scheme/scheme_v0.2/sexpr.hpp), but I don't know what Spirit version this needs. Mine is from boost 1.49, and it doesn't have `boost::spirit::unicode`. – Frank May 06 '12 at 23:55
  • The boost::spirit:unicode namespace is defined when setting the BOOST_SPIRIT_UNICODE variable before including any Boost::Spirit header file: `#define BOOST_SPIRIT_UNICODE` – Denis Arnaud Sep 15 '12 at 21:51

3 Answers3

7

I haven't got much experience with it, but apparently Spirit (SVN trunk version) supports Unicode.

#define BOOST_SPIRIT_UNICODE // We'll use unicode (UTF8) all throughout

See, e.g. the sexpr parser sample which is in the scheme demo.

BOOST_ROOT/libs/spirit/example/scheme

I believe this is based on the demo from a presentation by Bryce Lelbach1, which specifically showcases:

  • wchar support
  • utree attributes (still experimental)
  • s-expressions

There is an online article about S-expressions and variant.


1 In case it is indeed, here is the video from that presentation and the slides (pdf) as found here (odp)

Frank
  • 64,140
  • 93
  • 237
  • 324
sehe
  • 374,641
  • 47
  • 450
  • 633
  • Thanks, I've seen that example (see my 2nd comment above). It is not available in Boost 1.49, but I'll check out the latest SVN version of boost::spirit. – Frank May 07 '12 at 15:24
  • (Modified answer text to show that it's available in the SVN trunk version, as opposed to the official Boost downloads.) – Frank May 08 '12 at 20:45
3

In Boost 1.58 I can match any unicode symbols with this:

*boost::spirit::qi::unicode::char_

I don't know how to define a specific range of unicode symbols.

Sergey
  • 19,487
  • 13
  • 44
  • 68
2

You can't. The problem is not in boost::spirit but that Unicode is complicated. char doesn't mean a character, it means a 'byte'. And even if you work on the codepoint level, still a user perceived character may be represented by more than one codepoint. (e.g. пусты́нных is 9 characters but 10 codepoints. It may be not clear enough in Russian though because it doesn't use diacritics extensively. other languages do.)

To actually iterate over the user perceived character (or grapheme clusters in Unicode terminology), you'll need to use a Unicode specialized library, namely ICU.

However, what is the real-world use of iterating over the characters?

Yakov Galka
  • 70,775
  • 16
  • 139
  • 220
  • 1
    I want to build a parser that builds an AST from a regex that is provided as string input. So what I need to parse may look like this, for example, "ʉ*[a-ɧ]+". I'm fine with using ICU, as long as it somehow works with `boost::spirit`. – Frank May 06 '12 at 23:37
  • 1
    @Frank: But it's nonsense! What a-ɧ will mean in Unicode? And א-я? – Yakov Galka May 07 '12 at 06:58
  • 2
    It's not nonsense. Each unicode character has a code point, e.g., 'a' has U+0061 (=97) and ɧ has U+0267 (615). So the range "[a-ɧ]" means a character with code point >=97 and <=615. – Frank May 07 '12 at 16:25
  • It was about time this answer got an upvote. I'm stimied how I failed to notice this one before. – sehe Apr 16 '13 at 20:08
  • 1
    @Frank: it's nonsense because there is no any linguistic meaning in "all the characters between a and ɧ". In general, Unicode+Regexes is an incorrect interpretation of reality. What does work though, is *boundaries*. – Yakov Galka Apr 17 '13 at 07:24