In parsing a large 3 gigabyte file with DCG, efficiency is of importance.
The current version of my lexer is using mostly the or predicate ;/2 but I read that indexing can help.
Indexing is a technique used to quickly select candidate clauses of a predicate for a specific goal. In most Prolog systems, indexing is done (only) on the first argument of the head. If this argument is instantiated to an atom, integer, float or compound term with functor, hashing is used to quickly select all clauses where the first argument may unify with the first argument of the goal. SWI-Prolog supports just-in-time and multi-argument indexing. See section 2.18.
Can someone give an example of using indexing for lexing and possibly explain how it improves efficiency?
Details
Note: I changed some of the names before coping the source code into this question. If you find a mistake feel free to edit it here or leave me a comment and I will gladly fix it.
Currently my lexer/tokenizer (based on mzapotoczny/prolog-interpreter parser.pl) is this
% N.B.
% Since the lexer uses "" for values, the double_quotes flag has to be set to `chars`.
% If double_quotes flag is set to `code`, the the values with "" will not be matched.
:- use_module(library(pio)).
:- use_module(library(dcg/basics)).
:- set_prolog_flag(double_quotes,chars).
lexer(Tokens) -->
white_space,
(
( ":", !, { Token = tokColon }
; "(", !, { Token = tokLParen }
; ")", !, { Token = tokRParen }
; "{", !, { Token = tokLMusta}
; "}", !, { Token = tokRMusta}
; "\\", !, { Token = tokSlash}
; "->", !, { Token = tokImpl}
; "+", !, { Token = tokPlus }
; "-", !, { Token = tokMinus }
; "*", !, { Token = tokTimes }
; "=", !, { Token = tokEqual }
; "<", !, { Token = tokLt }
; ">", !, { Token = tokGt }
; "_", !, { Token = tokUnderscore }
; ".", !, { Token = tokPeriod }
; "/", !, { Token = tokForwardSlash }
; ",", !, { Token = tokComma }
; ";", !, { Token = tokSemicolon }
; digit(D), !,
number(D, N),
{ Token = tokNumber(N) }
; letter(L), !, identifier(L, Id),
{ member((Id, Token), [ (div, tokDiv),
(mod, tokMod),
(where, tokWhere)]),
!
; Token = tokVar(Id)
}
; [_],
{ Token = tokUnknown }
),
!,
{ Tokens = [Token | TokList] },
lexer(TokList)
; [],
{ Tokens = [] }
).
white_space -->
[Char], { code_type(Char, space) }, !, white_space.
white_space -->
"--", whole_line, !, white_space.
white_space -->
[].
whole_line --> "\n", !.
whole_line --> [_], whole_line.
digit(D) -->
[D],
{ code_type(D, digit) }.
digits([D|T]) -->
digit(D),
!,
digits(T).
digits([]) -->
[].
number(D, N) -->
digits(Ds),
{ number_chars(N, [D|Ds]) }.
letter(L) -->
[L], { code_type(L, alpha) }.
alphanum([A|T]) -->
[A], { code_type(A, alnum) }, !, alphanum(T).
alphanum([]) -->
[].
alphanum([]).
alphanum([H|T]) :- code_type(H, alpha), alphanum(T).
identifier(L, Id) -->
alphanum(As),
{ atom_codes(Id, [L|As]) }.
Here are some helper predicates used for development and testing.
read_file_for_lexing_and_user_review(Path) :-
open(Path,read,Input),
read_input_for_user_review(Input), !,
close(Input).
read_file_for_lexing_and_performance(Path,Limit) :-
open(Path,read,Input),
read_input_for_performance(Input,0,Limit), !,
close(Input).
read_input(Input) :-
at_end_of_stream(Input).
read_input(Input) :-
\+ at_end_of_stream(Input),
read_string(Input, "\n", "\r\t ", _, Line),
lex_line(Line),
read_input(Input).
read_input_for_user_review(Input) :-
at_end_of_stream(Input).
read_input_for_user_review(Input) :-
\+ at_end_of_stream(Input),
read_string(Input, "\n", "\r\t ", _, Line),
lex_line_for_user_review(Line),
nl,
print('Press spacebar to continue or any other key to exit: '),
get_single_char(Key),
process_user_continue_or_exit_key(Key,Input).
read_input_for_performance(Input,Count,Limit) :-
Count >= Limit.
read_input_for_performance(Input,_,_) :-
at_end_of_stream(Input).
read_input_for_performance(Input,Count0,Limit) :-
% print(Count0),
\+ at_end_of_stream(Input),
read_string(Input, "\n", "\r\t ", _, Line),
lex_line(Line),
Count is Count0 + 1,
read_input_for_performance(Input,Count,Limit).
process_user_continue_or_exit_key(32,Input) :- % space bar
nl, nl,
read_input_for_user_review(Input).
process_user_continue_or_exit_key(Key) :-
Key \= 32.
lex_line_for_user_review(Line) :-
lex_line(Line,TokList),
print(Line),
nl,
print(TokList),
nl.
lex_line(Line,TokList) :-
string_chars(Line,Code_line),
phrase(lexer(TokList),Code_line).
lex_line(Line) :-
string_chars(Line,Code_line),
phrase(lexer(TokList),Code_line).
read_user_input_for_lexing_and_user_review :-
print('Enter a line to parse or just Enter to exit: '),
nl,
read_string(user, "\n", "\r", _, String),
nl,
lex_line_for_user_review(String),
nl,
continue_user_input_for_lexing_and_user_review(String).
continue_user_input_for_lexing_and_user_review(String) :-
string_length(String,N),
N > 0,
read_user_input_for_lexing_and_user_review.
continue_user_input_for_lexing_and_user_review(String) :-
string_length(String,0).
read_user_input_for_lexing_and_user_review/0
allows a user to enter a string at the terminal for lexing and review the tokens.
read_file_for_lexing_and_user_review/1
Reads a file for lexing and review the tokens for each line one line at a time.
read_file_for_lexing_and_performance/2
Reads a file for lexing with a limit on the number of lines to lex. This is for use with gathering basic performance statistics to measure efficiency. Meant to be used with time/1.