Number of tokens in compiler

Question

I want to know number of tokens in the statement given below

a+++b---c

Please tell me number of tokens I told my viva teacher that there are 7 tokens but he said it is wrong.

"a" "++" "+" "b" "--" "-" "c" as said in https://stackoverflow.com/questions/7485088/what-does-the-operation-c-ab-mean#comment9059714_7485105 "lexer of C and C++, try to match the biggest string they can when they see something" and https://en.wikipedia.org/wiki/Maximal_munch — osgx, Jun 03 '17 at 14:23
akansh, you can't tokenize without using context in C and C++. And in some standard there are different tokens produced. — osgx, Jun 03 '17 at 14:30
The expression will be parsed only in some specific C standard or in C++ standard. +, ++, -, -- are same, but << and >> are parsed differently in various C++ standards. — osgx, Jun 03 '17 at 14:37
@osgx: in C, the only contextual tokenisation is the idiosyncratic handling of the `#include` preprocessor directive. Other than that, it's just maximal munch on a set of possible patterns. — rici, Jun 03 '17 at 14:38
@rici, what about `x=y/*z;` example from https://en.wikipedia.org/wiki/Maximal_munch ? Will it really parsed by gcc and clang as comment token? (clang code: http://code.metager.de/source/xref/llvm/clang/lib/Lex/Lexer.cpp#3310) — osgx, Jun 03 '17 at 14:39
@osgx: absolutely. TIAS. And it has been like that since the beginning. (But a comment is not a token. Comments are replaced with whitespace before preprocessing and whitespace is removed after preprocessing. Token are what are fed into syntactic analysis.) — rici, Jun 03 '17 at 14:40
It sounds very much like your instructor is simply wrong. What does your instructor say the tokens are? I'm having a hard time imagining an answer other than 7. — Steve Summit, Jun 03 '17 at 16:35

score 2 · Accepted Answer · answered Jun 03 '17 at 14:45

2

You are correct. There are seven tokens: (in C)

a
++
+
b
--
-
c

answered Jun 03 '17 at 14:45

rici

234,347
28
237
341

2

@akansh: shrug. – rici Jun 03 '17 at 14:49
@akanshsinghal If you believe your instructor, then why are you asking here? (I want to say, I believe the answer is 7 also, but if you're going to tell me your instructor disagrees with me, too, what's the point?) – Steve Summit Jun 03 '17 at 16:26

score 0 · Answer 2 · edited Jun 20 '20 at 09:12

According to the C standard (pre-C11 draft): http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1548.pdf

6.4 Lexical elements

3 A token is the minimal lexical element of the language in translation phases 7 and 8. The categories of tokens are: keywords, identifiers, constants, string literals, and punctuators. ...

4 If the input stream has been parsed into preprocessing tokens up to a given character, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token. ... 6 EXAMPLE 2 The program fragment x+++++y is parsed as x ++ ++ + y, which violates a constraint on increment operators, even though the parse x ++ + ++ y might yield a correct expression

6.4.6 Punctuators .. punctuator: one of ++ -- ... + -

So, https://en.wikipedia.org/wiki/Maximal_munch rule is used as was noted in https://stackoverflow.com/a/7485174 and comment about a+++b fragment: What does the operation c=a+++b mean? Shahbaz Sep 20 2011: "the lexer of C and C++, try to match the biggest string they can when they see something. ... Therefore, when the lexer sees the first plus, it tries the next character, it sees that it can match both characters as a ++, then continues on to see the next +. Hence, the parser sees a ++ + b"

While gcc and clang have complex code and may mix different translation phases of standard in the single code samples (so they are not best guides to the language, as rici said), we may check parsing implementation for ++ and --. When it sees char + it may generate different tokens based on what is the next char, if is + too, emit plusplus token, otherwise emit plus token:

http://code.metager.de/source/xref/llvm/clang/lib/Lex/Lexer.cpp#3264

3264  case '+':
3265    Char = getCharAndSize(CurPtr, SizeTmp);
3266    if (Char == '+') {
3267      CurPtr = ConsumeChar(CurPtr, SizeTmp, Result);
3268      Kind = tok::plusplus;
3269    } else if (Char == '=') {
3270      CurPtr = ConsumeChar(CurPtr, SizeTmp, Result);
3271      Kind = tok::plusequal;
3272    } else {
3273      Kind = tok::plus;
3274    }
3275    break;

http://code.metager.de/source/xref/gnu/gcc/libcpp/lex.c#2633

2633    case '+':
2634      result->type = CPP_PLUS;
2635      if (*buffer->cur == '+')
2636        buffer->cur++, result->type = CPP_PLUS_PLUS;
2637      else if (*buffer->cur == '=')
2638        buffer->cur++, result->type = CPP_PLUS_EQ;
2639      break;

So, tokens of a+++b---c expression are a ++ + b -- - c. Advisor may say you are wrong, but only wanting you to explain why you think you counted 7. And if the question is the same task as given and it is parsed according to C standard (or C++ which is same lexing for this example), you can explain your answer and show him relevant parts of language standards.

Gcc documents harder case: https://gcc.gnu.org/onlinedocs/cppinternals/Lexer.html "Escaped newlines are tedious because theoretically they can occur anywhere—between the ‘`+`’ and ‘`=`’ of the ‘`+=`’ token ... Moreover, you cannot be sure there is just one—there might be an arbitrarily long sequence of them." .. "Similarly code in the main body of `_cpp_lex_direct` cannot simply check for a ‘`=`’ after a ‘`+`’ character to determine whether it has a ‘`+=`’ token; it needs to be prepared for an escaped newline of some sort. Such cases use the function `get_effective_char`" — osgx, Jun 03 '17 at 15:04
Osgx: nikolai is right about one thing:the implementation of a particular compiler is not a guide to the language. A compiler may be buggy or non-conformant. Or, as in this case, it might attempt to take advantage of the "as-if" rule to use a short-cut. Both clang and gcc combine at least parts of phases one through three into a single lexer algorithm, so quoting their actual code does not provide a good illustration of the standard; indeed, it might be confusing to students. — rici, Jun 03 '17 at 15:27
@rici, Thanks! Is there any better code example with separated phases? What about compcert: https://github.com/AbsInt/CompCert/blob/master/cparser/Lexer.mll#L305 `rule initial = parse .... | "++" { INC(currentLoc lexbuf) }`? — osgx, Jun 03 '17 at 15:30
I don't know of one, but it is certainly possible. CompCert is an interesting compiler, but it does not pretend to validate all phases: (from the manual) "§6.10 Preprocessing directives The CompCert C compiler does not perform preprocessing itself, but delegates this task to an external C preprocessor, such as that of GCC. The external preprocessor is assumed to conform to the C99 standard." — rici, Jun 03 '17 at 20:28

Number of tokens in compiler

2 Answers2