1

I am writing a Python interpreter in OCaml using ocamllex, and in order to handle the indentation-based syntax, I want to

  1. tokenize the input using ocamllex
  2. iterate through the list of lexed tokens and insert INDENT and DEDENT tokens as needed for the parser
  3. parse this list into an AST

However, in ocamllex, the lexing step produces a lexbuf stream which can't be easily iterated through to do the indentation checking. Is there a good way to extract a list of tokens from lexbuf, i.e.

let lexbuf = (Lexing.from_channel stdin) in
let token_list = tokenize lexbuf

where token_list has type Parser.token list? My hack was to define a trivial parser like

tokenize: /* used by the parser to read the input into the indentation function */
  | token EOL { $1 @ [EOL] }
  | EOL { SEP :: [EOL] }

token:
  | COLON { [COLON] }
  | TAB { [TAB] }
  | RETURN { [RETURN] }
   ...
  | token token %prec RECURSE { $1 @ $2 }

and to call this like

    let lexbuf = (Lexing.from_channel stdin) in
    let temp = (Parser.tokenize Scanner.token) lexbuf in (* char buffer to token list *)

but this has all sorts of issues with shift-reduce errors and unnecessary complexity. Is there a better way to write a lexbuf -> Parser.token list function in OCaml?

JAustin
  • 890
  • 10
  • 16
  • Do you want to hand-write your own parser that takes the token list as its input or do you want to use `ocamlyacc` for your parser? In the latter case, I'd just define a tokenization function that wraps `Scanner.token` and forego the list of tokens altogether. – sepp2k Nov 04 '18 at 18:40
  • I have my own parser written in ocamlyacc. I just need to do some stack-based preprocessing to count indentation and insert INDENT and DEDENT tokens to handle Python-style syntax. Currently, I lex the input, preprocess it, then convert that back into a lexbuf and parse it. The preprocessing is just very hard to write in ocamlyacc, although I might be able to manage it. – JAustin Nov 04 '18 at 18:44
  • When you say "convert that back into a lexbuf", do you mean "back into a function `lexbuf -> token`"? How exactly do you do that? What I'm suggesting is that you do the preprocessing in a wrapper function, so that `wrapper lexbuf` calls `Scanner.token` and then you invoke your parser as `Parser.parse wrapper`. – sepp2k Nov 04 '18 at 18:51
  • I'll try that now. Currently I use a hack I found [here](https://stackoverflow.com/questions/10899544/feed-ocamlyacc-parser-from-explicit-token-list). But your idea could absolutely work. I'll try it now and update. Thanks! – JAustin Nov 04 '18 at 18:54

0 Answers0