1

I have a simple language which consists of patterns like

size(50*50)
start(10, 20, -x)
forward(15)
stop

It's an example of turtle-drawing language. I need to properly tokenize it. The above is a source code instance. Statements and expressions are separated with newlines. I set up my Scanner to use delimiters like newlines. I expect next("start") to eat the string "start", and then I issue next("(") to eat the first parenthesis. It appears however, that it does something else than I expect. Has the scanner already broken the above into tokens based on delimiter and/or do I need to approach this differently? For me, "start", "(", "50", "*", "50" and ")" on the first line would constitute separate tokens, which appears to be an unfulfilled expectation here. How can I tokenize the above with as little code as possible? I don't currently need to write a tokenizer, I am writing an interpreter, so tokenizing is something I don't want to spend my time on currently, I just like Scanner to work with me here.

My useDelimiter call is as follows:

Scanner s ///...
s.useDelimiter(Pattern.compile("[\\s]&&[^\\r\\n]"));

Issuing first next call gives me the entire file contents. Without the above call, it gives me entire first line.

Armen Michaeli
  • 8,625
  • 8
  • 58
  • 95

2 Answers2

3

To write a proper parser, you need to define your language in a formal grammar. Trust me, you want to do it properly or you will have problems downstream.

You can probably represent your tokens as regular expressions at the lowest level, but first you need to be clear about your grammar, which is combinations of tokens in lexical structures. You can represent this as recursive functions (methods), known as Productions. Each Production function can use scanner to test whether or not it is looking at a token it wants. But scanner will consume the input and you can't reverse.

If you used Scanner, you will find the following things unsuitable:

  1. It will always parse a token according to the regular expression,

    1.1 so even if you do get a token you can use, you will have to write more code to decide exactly what token it was

    1.2 and you may not be able to represent your language grammar as one big expression

  2. You can't re-wind. A look-ahead parser (required for lots of grammars like yours) needs to be able to look ahead at the input stream and then decide, if it wants, not to use the input and to let another token parser function use it.

I suggest you write the character lexer yourself, and iterate over a string / array of chars rather than a stream. Then you can re-wind.

Otherwise, use a ready-built lexer/parser framework like yacc or Coco/R.

Joe
  • 46,419
  • 33
  • 155
  • 245
  • I don't need look-ahead, and regular expressions work fine. In fact, the entire parser has already been written, but since my feeble assumptions about the Scanner class have proved to be unmet, I have a problem on my hands. I know a fair deal about parsers, production rules, formal grammars, including context-bound vs context-free. What I want to know is rather, whether the Scanner can be made to work with me on this one, I'd rather not write a lexer and a tokenizer myself. The language is incredibly simple (compared to say, C++), so I hope Scanner will do. – Armen Michaeli Oct 02 '12 at 08:15
  • Actually, I do need look ahead, and I use it too - `hasNext*` family of methods of the Scanner class looked like what I needed. Except that well, it looks unsuitable? – Armen Michaeli Oct 02 '12 at 08:55
  • Well, I think if you can condense all your tokens into one regular expression so that you can take a chunk (and then you'll have to re-read the chunk to see what token it is) then it's a good fit. I have a feeling that you won't be able to represent every possible token, in any context, by a single regex though. It's not the end of the world to write your own parser by hand! – Joe Oct 02 '12 at 09:46
  • I decided to ditch the Scanner class. I have read up on it, and apparently its name is misleading - it's not a scanner in the parsing context, it's more like a `scanf` in C and now I know it. In any case parsing my input using this class will be more cumbersome than needed. I will just write my own simple lexer (i.e. scanner in the "right" sense of the word, also called tokenizer) - the grammar is fairly simple and I can get away with fetching token values using regexes. – Armen Michaeli Oct 02 '12 at 09:55
2

The class java.io.StreamTokenizer may be a better fit. It is used in this example of a recursive descent parser.

Addendum: What is the principal difference between the StreamTokenizer and Scanner classes?

Either can do the lexical analysis required by a parser. StreamTokenizer is lighter weight but limited to four, pre-defined meta-tokens. Scanner is considerably more flexible, but somewhat more cumbersome to use. Here's a comparison of the two and variation on the latter.

Community
  • 1
  • 1
trashgod
  • 203,806
  • 29
  • 246
  • 1,045
  • What is the principal difference between the StreamTokenizer and Scanner classes? I am starting to think you may be right. What I need is actually a tokenizer - I merely need to extract the tokens, parser is of my own design. – Armen Michaeli Oct 02 '12 at 08:54