7

I have to use a parser and writer in c++, i am trying to implement the functions, however i do not understand what a token is. one of my function/operations is to check to see if there are more tokens to produce

bool Parser::hasMoreTokens()

how exactly do i go about this, please help

SO!

I am opening a text file with text in it, all words are lowercased. How do i go about checking to see if it hasmoretokens?

This is what i have

bool Parser::hasMoreTokens() {

while(source.peek()!=NULL){
    return true;
}
    return false;
}
Technupe
  • 4,831
  • 14
  • 34
  • 37
  • 1
    Please do not expect Stack Overflow to write your code for you. Especially if it's for homework (is it? it sounds like it). Show us what you've tried. If you simply have no idea what to do, and if (as I'm guessing) this is homework, then you should probably ask your teacher / professor / TA and they can (e.g.) point you to the relevant bit of your notes or textbook. – Gareth McCaughan Apr 12 '11 at 17:38

6 Answers6

11

Tokens are the output of lexical analysis and the input to parsing. Typically they are things like

  • numbers
  • variable names
  • parentheses
  • arithmetic operators
  • statement terminators

That is, roughly, the biggest things that can be unambiguously identified by code that just looks at its input one character at a time.

One note, which you should feel free to ignore if it confuses you: The boundary between lexical analysis and parsing is a little fuzzy. For instance:

  1. Some programming languages have complex-number literals that look, say, like 2+3i or 3.2e8-17e6i. If you were parsing such a language, you could make the lexer gobble up a whole complex number and make it into a token; or you could have a simpler lexer and a more complicated parser, and make (say) 3.2e8, -, 17e6i be separate tokens; it would then be the parser's job (or even the code generator's) to notice that what it's got is really a single literal.

  2. In some programming languages, the lexer may not be able to tell whether a given token is a variable name or a type name. (This happens in C, for instance.) But the grammar of the language may distinguish between the two, so that you'd like "variable foo" and "type name foo" to be different tokens. (This also happens in C.) In this case, it may be necessary for some information to be fed back from the parser to the lexer so that it can produce the right sort of token in each case.

So "what exactly is a token?" may not always have a perfectly well defined answer.

Gareth McCaughan
  • 19,888
  • 1
  • 41
  • 62
5

A token is whatever you want it to be. Traditionally (and for good reasons), language specifications broke the analysis into two parts: the first part broke the input stream into tokens, and the second parsed the tokens. (Theoretically, I think you can write any grammar in only a single level, without using tokens—or what is the same thing, using individual characters as tokens. I wouldn't like to see the results of that for a language like C++, however.) But the definition of what a token is depends entirely on the language you are parsing: most languages, for example, treat white space as a separator (but not Fortran); most languages will predefine a set of punctuation/operators using punctuation characters, and not allow these characters in symbols (but not COBOL, where "abc-def" would be a single symbol). In some cases (including in the C++ preprocessor), what is a token depends on context, so you may need some feedback from the parser. (Hopefully not; that sort of thing is for very experienced programmers.)

One thing is probably sure (unless each character is a token): you'll have to read ahead in the stream. You typically can't tell whether there are more tokens by just looking at a single character. I've generally found it useful, in fact, for the tokenizer to read a whole token at a time, and keep it until the parser needs it. A function like hasMoreTokens would in fact scan a complete token.

(And while I'm at it, if source is an istream: istream::peek does not return a pointer, but an int.)

James Kanze
  • 150,581
  • 18
  • 184
  • 329
3

A token is the smallest unit of a programming language that has a meaning. A parenthesis (, a name foo, an integer 123, are all tokens. Reducing a text to a series of tokens is generally the first step of parsing it.

Ernest Friedman-Hill
  • 80,601
  • 10
  • 150
  • 186
2

A token is usually akin to a word in sponken language. In C++, (int, float, 5.523, const) will be tokens. Is the minimal unit of text which constitutes a semantic element.

piotr
  • 5,657
  • 1
  • 35
  • 60
2

When you split a large unit (long string) into a group of sub-units (smaller strings), each of the sub-units (smaller strings) is referred to as a "token". If there are no more sub-units, then you are done parsing.

How do I tokenize a string in C++?

Community
  • 1
  • 1
Jess
  • 2,991
  • 3
  • 27
  • 40
0

A token is a terminal in a grammar, a sequence of one or more symbol(s) that is defined by the sequence itself, ie it does not derive from any other production defined in the grammar.

Felice Pollano
  • 32,832
  • 9
  • 75
  • 115