1

How C/C++ tokeniser/parser doesn't misunderstand the usage of '*', since it can be used for multiplication and for pointers type. eg:.

... {
    ...
    obj *var1; // * used to make var1 as pointer to obj
    var1 * var2; // * used to multiply var1 and var2
}

Update 1: While tokenising/parsing, we can't yet make difference between identifier that refers to a variable and identifier that refers to a type.

Update 2: (Context of question) I'm designing and implementing a programming language of C/C++ family, where pointers are declared like Pointer<int>, and I want to use C-pointer style instead.

Update 3 (on Dec 30, 2016): Some answers of this stackoverflow question about LR(1) parser and C++ seem to treat my question.

Community
  • 1
  • 1
Wael Boutglay
  • 1,163
  • 1
  • 15
  • 21
  • "*`obj *var1;`*": Multiplying a type and an undefined token does not make sense, so it could be be a variable definition. – alk Dec 26 '16 at 13:18
  • By knowing what is `obj`/`var1`... But indeed parsing C++ is complex. – Jarod42 Dec 26 '16 at 13:18
  • that is why we have keywords and identifiers. – Sourav Ghosh Dec 26 '16 at 13:19
  • but while lexing/parsing, we don't know yet if an identifier is a variable or a type – Wael Boutglay Dec 26 '16 at 13:20
  • The case differs for C and C++. In C you can not have a variable of the same name as a typename. Once a name is declared in typedef, it becomes another token type. In C++, the parser must implement the rule "when it can be a declaration it is a declaration", otherwise it is an expression. – Marian Dec 26 '16 at 13:20
  • @Marian: It is more complex than that, see [Demo](http://coliru.stacked-crooked.com/a/b56770258fbfaff2). `struct S; S*S;S*A;` it is not a declaration of `A`. – Jarod42 Dec 26 '16 at 13:24
  • 3
    C++ is not context-free parse-able. – Jarod42 Dec 26 '16 at 13:27
  • @Jarod42 Obviously, it is more difficult than it could be explained in a few lines. It is just the point. In your example `S * S` defines a new variable `S` which overrides and hides the type 'S' for `S * A`. – Marian Dec 26 '16 at 13:32
  • So your sentence *"the parser must implement the rule 'when it can be a declaration it is a declaration'"* is wrong in OP's context. – Jarod42 Dec 26 '16 at 13:38
  • Have you try `fun = 42 /*ptr`? – Stargateur Dec 26 '16 at 13:42
  • `S * A;` might be a declaration, but it is not. The rule applies in some places on the grammar and/or in specific contexts. – Jarod42 Dec 26 '16 at 13:59
  • 1
    @WaelBoutglay: To simplify your grammar, you may add keyword to declare variable, and so avoid those ambiguity. (as `let = `). – Jarod42 Dec 26 '16 at 14:02
  • @Jarod42, the goal is to make C/C++ codes valid in my language, so developers don't have to build wrapper for existing C/C++ libraries – Wael Boutglay Dec 26 '16 at 14:22

1 Answers1

2

The tokeniser doesn't make a distinction between the two. It just treats it as the token *.

The parser knows how to look up names. It knows that obj is a type, so can parse <type> * <identifier> differently from <non-type> * <non-type>. Your instinct is on to something: it's not possible to parse just the syntax of C without implementing any of the semantics. The only way to get a correct parse of the C syntax requires interpreting declarations and keeping track of which names name types and which name non-types. Your update:

While tokenising/parsing, we can't yet make difference between identifier that refers to a variable and identifier that refers to a type.

is not quite right, since it assumes that tokenising/parsing is done all at once as a separate step. In fact, parsing and semantic analysis are interleaved. When typedef int obj; is parsed, it is interpreted and taken to mean obj now names a type. When parsing continues and obj * var1; is seen, the results of the earlier semantic analysis are available for use.

  • You statement about "_to get a correct parse..._" is quite incorrect! C is a context free grammar. The parser only looks up types _after_ parsing. The type does not influence the parsing (context free!). In this case, `*` used as _unary_ operator is dereference so `a**b` means to dereference `b` and multiply that with `a` and `a*b` can only mean multiplication. – Paul Ogilvie Dec 26 '16 at 15:22
  • See e.g. https://gist.github.com/codebrainz/2933703 for the context free C99 grammar. – Paul Ogilvie Dec 26 '16 at 15:29
  • @PaulOgilvie `a*b;` gets parsed completely differently depending on whether `a` is a type name. You're quite wrong that `a*b;` can only mean multiplication, it can alternatively mean "declare `b` as a pointer to type `a`". The grammar you link to shows an unimplemented `check_type` function that needs to be added to the lexer to return either `TYPE_NAME` or `IDENTIFIER`, whichever is appropriate. Implementing that requires partially implementing the semantics of C, just like I pointed out in my answer. –  Dec 26 '16 at 15:40
  • I see. I stand corrected. If you make a small edit to your answer, I will undo my down-vote. – Paul Ogilvie Dec 26 '16 at 15:49
  • @hvd, you're absolutely right about tokeniser, I just checked Clang source code and found that the lexer/tokeniser returns a token of kind `tok::star` – Wael Boutglay Dec 26 '16 at 18:04