1

So I've been working with the Boost Spirit Compiler tutorial. Currently, it works great with integers. I am working on a way to extend it to handle strings. Here is the link to the source code.

http://www.boost.org/doc/libs/1_57_0/libs/spirit/example/qi/compiler_tutorial/mini_c/

For those familiar with Boost, the following should look familiar - it is a production rule for a primary expression:

 primary_expr =
            uint_
        |   function_call
        |   identifier
        |   bool_
        |   '(' > expr > ')'
        ;

uint_ is what allows a primary_expr to be assigned an int. Normally, we could add some simple functionality for a char or a string by either creating a few more production rules, or else a simple text parser using a regex that identifies quotes or something like that. There are tons of examples if you back up the root of the link I sent.

The real problem comes with the fact that to implement the compiler, the code pushes bytecode operations into a vector. It's trivial to push a single char here, since all chars have an accompanying ASCII code that it will be implicitly converted to, but not the case for an array of chars, since they would lose their context in the process as part of a larger string (that forms a sentence, eg).

The best option I can come up with is to change the

vector<int>

to

vector<uintptr_t> 

From my understanding, this type of pointer can point to both integers and chars. Though, it's not simply a matter of changing the 'uint_' to 'uintptr_t' within the above production rule. The compiler tells me that it's an illegal use in this particular instance.

By the way, you will see the implementation of our vector holding the bytecode within the compiler.cpp/.hpp files.

Any help would be appreciated, and if you need any more information, please ask. Thanks.

Dylan_Larkin
  • 503
  • 4
  • 15

1 Answers1

1

Normally, we could add some simple functionality for a char or a string by either creating a few more production rules, or else a simple text parser using a regex that identifies quotes or something like that

Regex is not supported. You can use a subset of regular expression syntax in Boost Spirit Lex patterns (which can be used in token_def) but that would complicate the picture considerably.

The real problem comes with the fact that to implement the compiler, the code pushes bytecode operations into a vector. It's trivial to push a single char here, since all chars have an accompanying ASCII code that it will be implicitly converted to, but not the case for an array of chars, since they would lose their context in the process as part of a larger string (that forms a sentence, eg).

In jargon: the AST doesn't accommodate non-integral values.

The simples way would be to extend the AST for an operand:

typedef boost::variant<
        nil
      , bool
      , unsigned int
      , identifier
      , std::string                          // ADDED
      , boost::recursive_wrapper<unary>
      , boost::recursive_wrapper<function_call>
      , boost::recursive_wrapper<expression>
    >
operand;

(Note: this is also the type of attribute exposed by primary_expr and unary_expr)

Now lets extend the rules:

    quoted_string = '"' >> *('\\' >> char_ | ~char_('"')) >> '"';

    primary_expr =
            uint_
        |   function_call
        |   identifier
        |   quoted_string
        |   bool_
        |   ('(' > expr > ')')
        ;

Note that we declared quoted_string without a skipper so we don't have to do the lexeme[] incantation (Boost spirit skipper issues).


Compiler support

Next, when compiling it turns out the compiler visitor doesn't know the strings yet. So, we add

    op_string,      //  push constant string into the stack

and

bool compiler::operator()(std::string const& x)
{
    BOOST_ASSERT(current != 0);
    current->op(op_string, x);
    return true;
}

in the respective places.


(still https://www.livecoding.tv/sehe/ coding, pushed the answer so you can read it ahead of time)

Community
  • 1
  • 1
sehe
  • 374,641
  • 47
  • 450
  • 633
  • Ahhhhhhhhh, this is exactly what I did when I first tried it with a char. When I tried extending the AST for a string operand and subsequently updated the functor, I had it taking a string, rather than a reference to a string. Ugh. I'll try this. Thanks! – Dylan_Larkin Dec 01 '15 at 16:01
  • It's probably best if you watch the stream :) There's a chat box there too. And it's not over until the fat lady sings (it's not trivial) – sehe Dec 01 '15 at 16:02
  • did you have any luck? – Dylan_Larkin Dec 01 '15 at 17:17
  • Not yet. Life came in between. Later tonight. The thing is, it's not trivial to decide what should _actually happen_ semantically. Making it compile was relatively straightforward. Making it parse, no problem. But making it codegen was only half easy - the second part tripped me up because apparently the stack model is different from what I expected. I'll have to see whether it is at all possible to have dynamic length mutable string variables. – sehe Dec 01 '15 at 17:46
  • I watched your video. Why that approach, rather than changing vector& code to contain various pointers, rather than just int? – Dylan_Larkin Dec 01 '15 at 19:16
  • I guess what I meant was, if it's not possible as you posited, perhaps changing the stack is one solution. But sure, in this case it's prob safer/less hassle not to do that. – Dylan_Larkin Dec 01 '15 at 19:42
  • I think you're confusing stack and opcodes. And, I forgot to mention type safety :) I'll look at things later tonight – sehe Dec 01 '15 at 19:49
  • Yes I am. But we're still pushing operands of different types into the vector of opcodes, so how does your current method circumvent that? – Dylan_Larkin Dec 01 '15 at 19:58
  • I'll address it later - considering you apparently missed it (in `function::op`) – sehe Dec 01 '15 at 20:00
  • Ah. uint8_t. Yes, admittedly I watched your video via fast forward and no sound. The Spark Notes version. – Dylan_Larkin Dec 01 '15 at 20:11
  • my quick and dirty solution to this was to just convert and output each int as its ASCII equivalent with a simple helper function. I still need to add an algorithm to be able to limit the output to just the strings, vs all ints within vector& code. After that, I will look to implement my own string manipulation functions. I am still very much interested in a more robust solution, however, if you have the time. – Dylan_Larkin Dec 07 '15 at 17:05