6

I'm trying to parse LaTeX escape codes (e.g. \alpha) to the Unicode (Mathematical) characters (i.e. U+1D6FC).

Right now this means I am using this symbols parser (rule):

struct greek_lower_case_letters_ : x3::symbols<char32_t>
{
  greek_lower_case_letters_::greek_lower_case_letters_()
  {
    add("alpha",   U'\u03B1');
  }
} greek_lower_case_letter;

This works fine but means I'm getting a std::u32string as a result. I'd like an elegant way to keep the Unicode code points in the code (for maybe future automation) and maintenance reasons. Is there a way to get this kind of parser to parse into a UTF-8 std::string?

I thought of making the symbols struct parse to a std::string, but that would be highly inefficient (I know, premature optimization bla bla).

I was hoping there was some elegant way instead of going through a bunch of hoops to get this working (symbols appending strings to the result).

I do fear though that using the code point values and wanting UTF8 will incur a runtime cost of the conversion (or is there a constexpr UTF32->UTF8 conversion possibe?).

ildjarn
  • 62,044
  • 9
  • 127
  • 211
rubenvb
  • 74,642
  • 33
  • 187
  • 332

1 Answers1

7

The JSON parser example at cierelabs shows an approach that uses semantic actions to append code points in utf8 encoding:

  auto push_utf8 = [](auto& ctx)
  {
     typedef std::back_insert_iterator<std::string> insert_iter;
     insert_iter out_iter(_val(ctx));
     boost::utf8_output_iterator<insert_iter> utf8_iter(out_iter);
     *utf8_iter++ = _attr(ctx);
  };

  // ...

  auto const escape =
         ('u' > hex4)           [push_utf8]
     |   char_("\"\\/bfnrt")    [push_esc]
     ;

This is used in their

typedef x3::rule<unicode_string_class, std::string> unicode_string_type;

Which, as you can see, build the utf8 sequence into a std::string attribute.

See for full code: https://github.com/cierelabs/json_spirit/blob/x3_devel/ciere/json/parser/x3_grammar_def.hpp

sehe
  • 374,641
  • 47
  • 450
  • 633
  • I decided using `std::string` as symbol key/value, and I'm trying to get the `char_` rule to work as a sequence using the `repeat` directive. Comparison of the UTF8 and UTF32 version [here](http://coliru.stacked-crooked.com/a/47a50fdbec15cd31). I don't understand why the second version fails after the first `\alpha`. – rubenvb Dec 19 '15 at 15:34
  • @rubenvb I'll look at that later tonight. – sehe Dec 19 '15 at 15:36
  • @rubenvb interestingly, in my tests, the _first_ version failed after the first `'a'`. It has to do with attribute propagation; if the `symbols` yields the same type (std::string) as the enclosing, it gets _assigned_ instead of _appended_ (I feel this is a bug). So, instead, I'd use `std::vector` as the attribute, and it works correctly. Here's some cleaned up code: http://coliru.stacked-crooked.com/a/b9555dfd246b5252(note the `reinterpret_cast<>` business looked wrong, I changed it). – sehe Dec 19 '15 at 16:43
  • @rubenvb maybe you should post this as a separate question. I'll try to remember to ask on the mailing list about this behaviour. The live stream is here: https://www.livecoding.tv/video/debugging-utf8utf32-propagation-in-spirit-x3/ (first part missing due technical problems) – sehe Dec 19 '15 at 16:43
  • I ended up choosing a user-defined-string-literal that creates a `std::array`. Avoids this maybe-bug, is (in principle) a compile time codepoint->UTF8 conversion, and can be extended to composed characters without much fuss. The code I ended up with (for now) is [here](https://github.com/rubenvb/spiritoflatex/commit/cc09da58209e48085801b61e10b563076726323e#diff-a972c0fa0e7d97b40660016c42e50d38). I'm going to parse this to some AST representation, from which I'll synthesize some limited form of Qt's supported HTML for starters. Thanks for the insight though. – rubenvb Dec 21 '15 at 08:55