How can lexing efficiency be improved?

Question

In parsing a large 3 gigabyte file with DCG, efficiency is of importance.

The current version of my lexer is using mostly the or predicate ;/2 but I read that indexing can help.

Indexing is a technique used to quickly select candidate clauses of a predicate for a specific goal. In most Prolog systems, indexing is done (only) on the first argument of the head. If this argument is instantiated to an atom, integer, float or compound term with functor, hashing is used to quickly select all clauses where the first argument may unify with the first argument of the goal. SWI-Prolog supports just-in-time and multi-argument indexing. See section 2.18.

Can someone give an example of using indexing for lexing and possibly explain how it improves efficiency?

Details

Note: I changed some of the names before coping the source code into this question. If you find a mistake feel free to edit it here or leave me a comment and I will gladly fix it.

Currently my lexer/tokenizer (based on mzapotoczny/prolog-interpreter parser.pl) is this

% N.B.
% Since the lexer uses "" for values, the double_quotes flag has to be set to `chars`.
% If double_quotes flag is set to `code`, the the values with "" will not be matched.

:- use_module(library(pio)). 
:- use_module(library(dcg/basics)).
:- set_prolog_flag(double_quotes,chars).

lexer(Tokens) -->
   white_space,
   (
       (  ":",       !, { Token = tokColon }
      ;  "(",       !, { Token = tokLParen }
      ;  ")",       !, { Token = tokRParen }
      ;  "{",       !, { Token = tokLMusta}
      ;  "}",       !, { Token = tokRMusta}
      ;  "\\",      !, { Token = tokSlash}
      ;  "->",      !, { Token = tokImpl}
      ;  "+",       !, { Token = tokPlus }
      ;  "-",       !, { Token = tokMinus }
      ;  "*",       !, { Token = tokTimes }
      ;  "=",       !, { Token = tokEqual }
      ;  "<",       !, { Token = tokLt }
      ;  ">",       !, { Token = tokGt }
      ;  "_",       !, { Token = tokUnderscore }
      ;  ".",       !, { Token = tokPeriod }
      ;  "/",       !, { Token = tokForwardSlash }
      ;  ",",       !, { Token = tokComma }
      ;  ";",       !, { Token = tokSemicolon }
      ;  digit(D),  !,
            number(D, N),
            { Token = tokNumber(N) }
      ;  letter(L), !, identifier(L, Id),
            {  member((Id, Token), [ (div, tokDiv),
                                     (mod, tokMod),
                                     (where, tokWhere)]),
               !
            ;  Token = tokVar(Id)
            }
      ;  [_],
            { Token = tokUnknown }
      ),
      !,
      { Tokens = [Token | TokList] },
      lexer(TokList)
   ;  [],
         { Tokens = [] }
   ).

white_space -->
   [Char], { code_type(Char, space) }, !, white_space.
white_space -->
    "--", whole_line, !, white_space.
white_space -->
   [].

whole_line --> "\n", !.
whole_line --> [_], whole_line.

digit(D) -->
   [D],
      { code_type(D, digit) }.

digits([D|T]) -->
   digit(D),
   !,
   digits(T).
digits([]) -->
   [].

number(D, N) -->
   digits(Ds),
      { number_chars(N, [D|Ds]) }.

letter(L) -->
   [L], { code_type(L, alpha) }.

alphanum([A|T]) -->
   [A], { code_type(A, alnum) }, !, alphanum(T).
alphanum([]) -->
   [].

alphanum([]).
alphanum([H|T]) :- code_type(H, alpha), alphanum(T).

identifier(L, Id) -->
   alphanum(As),
      { atom_codes(Id, [L|As]) }.

Here are some helper predicates used for development and testing.

read_file_for_lexing_and_user_review(Path) :-
    open(Path,read,Input),
    read_input_for_user_review(Input), !,
    close(Input).

read_file_for_lexing_and_performance(Path,Limit) :-
    open(Path,read,Input),
    read_input_for_performance(Input,0,Limit), !,
    close(Input).

read_input(Input) :-
    at_end_of_stream(Input).

read_input(Input) :-
    \+ at_end_of_stream(Input),
    read_string(Input, "\n", "\r\t ", _, Line),
    lex_line(Line),
    read_input(Input).

read_input_for_user_review(Input) :-
    at_end_of_stream(Input).

read_input_for_user_review(Input) :-
    \+ at_end_of_stream(Input),
    read_string(Input, "\n", "\r\t ", _, Line),
    lex_line_for_user_review(Line),
    nl,
    print('Press spacebar to continue or any other key to exit: '),
    get_single_char(Key),
    process_user_continue_or_exit_key(Key,Input).

read_input_for_performance(Input,Count,Limit) :-
    Count >= Limit.

read_input_for_performance(Input,_,_) :-
    at_end_of_stream(Input).

read_input_for_performance(Input,Count0,Limit) :-
    % print(Count0),
    \+ at_end_of_stream(Input),
    read_string(Input, "\n", "\r\t ", _, Line),
    lex_line(Line),
    Count is Count0 + 1,
    read_input_for_performance(Input,Count,Limit).

process_user_continue_or_exit_key(32,Input) :-  % space bar
    nl, nl,
    read_input_for_user_review(Input).

process_user_continue_or_exit_key(Key) :-
    Key \= 32.

lex_line_for_user_review(Line) :-
    lex_line(Line,TokList),
    print(Line),
    nl,
    print(TokList),
    nl.

lex_line(Line,TokList) :-
    string_chars(Line,Code_line),
    phrase(lexer(TokList),Code_line).

lex_line(Line) :-
    string_chars(Line,Code_line),
    phrase(lexer(TokList),Code_line).

read_user_input_for_lexing_and_user_review :-
    print('Enter a line to parse or just Enter to exit: '),
    nl,
    read_string(user, "\n", "\r", _, String),
    nl,
    lex_line_for_user_review(String),
    nl,
    continue_user_input_for_lexing_and_user_review(String).

continue_user_input_for_lexing_and_user_review(String) :-
    string_length(String,N),
    N > 0,
    read_user_input_for_lexing_and_user_review.

continue_user_input_for_lexing_and_user_review(String) :-
    string_length(String,0).

read_user_input_for_lexing_and_user_review/0 allows a user to enter a string at the terminal for lexing and review the tokens.

read_file_for_lexing_and_user_review/1 Reads a file for lexing and review the tokens for each line one line at a time.

read_file_for_lexing_and_performance/2 Reads a file for lexing with a limit on the number of lines to lex. This is for use with gathering basic performance statistics to measure efficiency. Meant to be used with time/1.

Of interest: [Choice points and Redo's in Prolog](https://stackoverflow.com/a/45426421/1243762) - How indexing affects SWI-Prolog tracer. — Guy Coder, Jan 18 '19 at 22:21
Of interest: [How is a integer created as a character code constant?](https://stackoverflow.com/q/41637402/1243762) - Explains use of Prolog character code constants such as `0'\n` — Guy Coder, Jan 18 '19 at 22:23
Of interest: [Stack overflow in Prolog DCG grammar rule: how to handle large lists efficiently or lazily](https://stackoverflow.com/a/12942551/1243762) This it a Q&A about parsing using DCG and the answer has a section about exploiting indexing. — Guy Coder, Jan 18 '19 at 22:40
Of interest: GitHub SWI-Prolog swipl-devel/src/Tests/core/[test_dcg.pl](https://github.com/SWI-Prolog/swipl-devel/blob/4bbfbecd32eda899b03e38e6a60054bdb493e579/src/Tests/core/test_dcg.pl) — Guy Coder, Jan 19 '19 at 00:52
Of interest: GitHub SWI-Prolog swipl-devel/src/Unicode/[derived_core_properties.pl](https://github.com/SWI-Prolog/swipl-devel/blob/4bbfbecd32eda899b03e38e6a60054bdb493e579/src/Unicode/derived_core_properties.pl) - Real world examples of parsing with DCG. — Guy Coder, Jan 19 '19 at 00:53
Of interest: GitHub SWI-Prolog swipl-devel/library/statistics.pl [time/1](https://github.com/SWI-Prolog/swipl-devel/blob/4bbfbecd32eda899b03e38e6a60054bdb493e579/library/statistics.pl#L271) - Source code for time/1. Could extend this to show more statistics. — Guy Coder, Jan 19 '19 at 01:00
Of interest: GitHub SWI-Prolog swipl-devel/library/[console_input.pl](https://github.com/SWI-Prolog/swipl-devel/blob/4bbfbecd32eda899b03e38e6a60054bdb493e579/library/console_input.pl) - Another example using DCG with indexing. — Guy Coder, Jan 19 '19 at 01:03
Of interest: GitHub SWI-Prolog swipl-devel/library/[utf8.pl](https://github.com/SWI-Prolog/swipl-devel/blob/4bbfbecd32eda899b03e38e6a60054bdb493e579/library/utf8.pl) - Why does this DCG also use lots of or predicates `;/2` ? — Guy Coder, Jan 19 '19 at 01:09
Of interest: SWI-Prolog [Character properties](http://www.swi-prolog.org/pldoc/man?section=chartype) - In paticular [char_type/2](http://www.swi-prolog.org/pldoc/doc_for?object=char_type/2) and [code_type/2](http://www.swi-prolog.org/pldoc/doc_for?object=code_type/2) — Guy Coder, Jan 19 '19 at 08:36
Of interest: SWI-Prolog [Environment Control (Prolog flags)](http://www.swi-prolog.org/pldoc/man?section=flags) - In particular [set_prolog_flag/2](http://www.swi-prolog.org/pldoc/doc_for?object=set_prolog_flag/2), [current_prolog_flag/2](http://www.swi-prolog.org/pldoc/doc_for?object=current_prolog_flag/2), flag [double_quotes](http://www.swi-prolog.org/pldoc/man?section=flags#flag:double_quotes), and flag [back_quotes](http://www.swi-prolog.org/pldoc/man?section=flags#flag:back_quotes) — Guy Coder, Jan 19 '19 at 08:54
Of interest: SWI-Prolog [The string type and its double quoted syntax](http://www.swi-prolog.org/pldoc/man?section=strings) - In particular [string_codes/2](http://www.swi-prolog.org/pldoc/man?section=strings#string_codes/2) and [read_string/5](http://www.swi-prolog.org/pldoc/man?section=strings#read_string/5) — Guy Coder, Jan 19 '19 at 09:03
Of interest: SWI-Prolog [Why has the representation of double quoted text changed?](http://www.swi-prolog.org/pldoc/man?section=ext-dquotes-motivation) - With regards to indexing see last section `Checks for a character to be in a set` (Was unable to get a direct link to the section). — Guy Coder, Jan 19 '19 at 09:10
Of interest: SWI-Prolog [portray_text/1](http://www.swi-prolog.org/pldoc/man?predicate=portray_text/1) - Helps with debugging list of character codes. — Guy Coder, Jan 19 '19 at 09:12
Of interest: SWI-Prolog [library(readutil): Reading lines, streams and files](http://www.swi-prolog.org/pldoc/man?section=readutil) - In particular [read_line_to_codes/2](http://www.swi-prolog.org/pldoc/man?section=readutil#read_line_to_codes/2) which might be better than using read_string/5 and string_chars/2 in combination. — Guy Coder, Jan 19 '19 at 09:22
Of interest: SWI-Prolog library(pio): [Pure I/O](http://www.swi-prolog.org/pldoc/man?section=pio) - `This module is part of pio.pl, dealing with pure input: processing input streams from the outside world using pure predicates, notably grammar rules (DCG). Using pure predicates makes non-deterministic processing of input much simpler.` — Guy Coder, Jan 19 '19 at 09:26
Of interest: SWI-Prolog library(dcg/basics): [Various general DCG utilities](http://www.swi-prolog.org/pldoc/man?section=basics) — Guy Coder, Jan 19 '19 at 09:30
Of interest: [What is the difference between ' and " in Prolog?](https://stackoverflow.com/a/8269897/1243762) — Guy Coder, Jan 19 '19 at 09:32
Of interest: [Prolog getting head and tail of string](https://stackoverflow.com/a/36645725/1243762) - This answer talks about using Prolog flags `back_quotes` and `double_quotes`. — Guy Coder, Jan 19 '19 at 09:42
Of interest: [Island grammar](https://en.wikipedia.org/wiki/Island_grammar) - Possibly a new term for those not use to writing parsers. For more info read [Islands in the Stream](http://media.pragprog.com/titles/tpantlr2/islands.pdf) — Guy Coder, Jan 19 '19 at 10:19
Of interest: SWI-Prolog [Indexing databases](http://www.swi-prolog.org/pldoc/man?section=strings) — Guy Coder, Jan 19 '19 at 15:55
Of interest: SWI-Prolog [Character Escape Syntax](http://www.swi-prolog.org/pldoc/man?section=charescapes) - When unit testing need to test with unprintable characters. Knowing these representations help. — Guy Coder, Jan 21 '19 at 14:48
Of interest: SWI-Prolog [Unit Testing](http://www.swi-prolog.org/pldoc/doc_for?object=section(%27packages/plunit.html%27)) - You should always do unit testing. Using this helped me to find some errors, make some enhancements, and write variations of the code knowing that the test results correct and keeping me honest. — Guy Coder, Jan 23 '19 at 13:17
Of interest: [XSB: Extending Prolog with Tabled Logic Programming](https://arxiv.org/pdf/1012.5123v1.pdf) — Guy Coder, Mar 03 '19 at 10:12
Of interest: [Prolog DCG: find last element](https://stackoverflow.com/q/21457625/1243762) — Guy Coder, Mar 04 '19 at 12:14

score 3 · Answer 1 · answered Jan 19 '19 at 23:00

One thing it means is that this is silly code:

token(T) -->
    ( "1", !, { T = one }
    ; "2", !, { T = two }
    ; "3", !, { T = three }
    )

This is less silly code:

token(T) --> one_two_three(T).

one_two_three(one) --> "1".
one_two_three(two) --> "2".
one_two_three(three) --> "3".

But still not so good. Maybe better:

token(T) --> [X], { one_two_three(X, T) }.

one_two_three(0'1, one).
one_two_three(0'2, two).
one_two_three(0'3, three).

Last example also starts to look silly but remember that now you have indexing on first argument. You read once, no choice point, no backtrack.

But if you want to really know how to write efficient you need to measure where the time and space goes. Have you measured?

But if you really want to know how to fix you maybe read "Craft of Prolog", I do not understand all of this book but I remember it had big section on DCG.

But if you really want to parse such formats large files maybe find existing libraries in other languages, it might be much faster than fastest Prolog.

score 2 · Accepted Answer · edited Mar 03 '19 at 12:38

Solution:

You should replace the following:

lexer(Tokens) -->
   white_space,
   (
      (  ":",       !, { Token = tokColon }
      ;  "(",       !, { Token = tokLParen }
      ;  ")",       !, { Token = tokRParen }
      ;  "{",       !, { Token = tokLMusta}
      ;  "}",       !, { Token = tokRMusta}
      ;  "\\",      !, { Token = tokSlash}
      ;  "->",      !, { Token = tokImpl}
      ;  "+",       !, { Token = tokPlus }
      ;  "-",       !, { Token = tokMinus }
      ;  "*",       !, { Token = tokTimes }
      ;  "=",       !, { Token = tokEqual }
      ;  "<",       !, { Token = tokLt }
      ;  ">",       !, { Token = tokGt }
      ;  "_",       !, { Token = tokUnderscore }
      ;  ".",       !, { Token = tokPeriod }
      ;  "/",       !, { Token = tokForwardSlash }
      ;  ",",       !, { Token = tokComma }
      ;  ";",       !, { Token = tokSemicolon }
      ;  digit(D),  !,
            number(D, N),
            { Token = tokNumber(N) }
      ;  letter(L), !, identifier(L, Id),
            {  member((Id, Token), [ (div, tokDiv),
                                     (mod, tokMod),
                                     (where, tokWhere)]),
               !
            ;  Token = tokVar(Id)
            }
      ;  [_],
            { Token = tokUnknown }
      ),
      !,
      { Tokens = [Token | TokList] },
      lexer(TokList)
   ;  [],
         { Tokens = [] }
   ).

with

lexer(Tokens) -->
   white_space,
   (
      (
         op_token(Token), ! % replace ;/2 long chain searched blindly with call to new predicate op_token//1 which clauses have indexed access by first arg in Prolog standard way
      ;
         digit(D),  !, number(D, N),
         { Token = tokNumber(N) }
      ;  letter(L), !, identifier(L, Id),
         {  member((Id, Token), [ (div, tokDiv),
                                 (mod, tokMod),
                                 (where, tokWhere)]),
            !
      ;  Token = tokVar(Id)
         }
      ;  [_],
         { Token = tokUnknown }
      ),
      !,
      { Tokens = [Token | TokList] },
      lexer(TokList)
   ;
      [],
      { Tokens = [] }
   ).

%%%
op_token(tokColon)      --> ";".
op_token(tokLParen)     --> "(".
op_token(tokRParen)     --> ")".
op_token(tokLMusta)     --> "{".
op_token(tokRMusta)     --> "}".
op_token(tokBackSlash)  --> "\\".
op_token(tokImpl)       --> "->".
op_token(tokPlus)       --> "+".
op_token(tokMinus)      --> "-".
op_token(tokTimes)      --> "*".
op_token(tokEqual)      --> "=".
op_token(tokLt)         --> "<".
op_token(tokGt)         --> ">".
op_token(tokUnderscore) --> "_".
op_token(tokPeriod)     --> ".".
op_token(tokSlash)      --> "/".
op_token(tokComma)      --> ",".
op_token(tokSemicolon)  --> ";".

Edit by Guy Coder

I ran a test using the example data posted in the question into a list where each item in the list was a line in the data converted to character codes. Then with time/1 called lexer on each item in the list and repeated the test for the list 10000 times. The reason the data was loaded into a list and converted to characters codes before time/1 was so that those processes did not skew the results. Each of these runs was repeated 5 times to get a consistency of data.

In the following runs below, for all of the different versions the lexer was extended to cover all of the 7-bit ASCII characters which significantly increased the number of cases for special characters.

The version of Prolog used for the following was SWI-Prolog 8.0.

For the version in the question.

Version: 1

:- set_prolog_flag(double_quotes,chars).

% 694,080,002 inferences, 151.141 CPU in 151.394 seconds (100% CPU, 4592280 Lips)
% 694,080,001 inferences, 150.813 CPU in 151.059 seconds (100% CPU, 4602271 Lips)
% 694,080,001 inferences, 152.063 CPU in 152.326 seconds (100% CPU, 4564439 Lips)
% 694,080,001 inferences, 151.141 CPU in 151.334 seconds (100% CPU, 4592280 Lips)
% 694,080,001 inferences, 151.875 CPU in 152.139 seconds (100% CPU, 4570074 Lips)

For the version as posted above in this answer

Version: 2

:- set_prolog_flag(double_quotes,chars).

% 773,260,002 inferences, 77.469 CPU in 77.543 seconds (100% CPU, 9981573 Lips)
% 773,260,001 inferences, 77.344 CPU in 77.560 seconds (100% CPU, 9997705 Lips)
% 773,260,001 inferences, 77.406 CPU in 77.629 seconds (100% CPU, 9989633 Lips)
% 773,260,001 inferences, 77.891 CPU in 77.967 seconds (100% CPU, 9927511 Lips)
% 773,260,001 inferences, 78.422 CPU in 78.644 seconds (100% CPU, 9860259 Lips)

Version 2 gives a dramatic improvement by using indexing from Version 1.

In doing further research on the code, upon looking at op_token which is DCG and has two hidden variables for implicitly passing around a state representation, using listing/1 showed:

op_token(tokUnderscore,['_'|A], A).

Notice that the first parameter is not the character being searched and that in this answer the indexing code is written as

c_digit(0'0,0).

where the first parameter is the character being searched and the second parameter is the result.

So change this

op_token(Token), !

to this

[S], { special_character_indexed(S,Token) }

with indexed clauses as

special_character_indexed( ';' ,tokSemicolon).

Version: 3

:- set_prolog_flag(double_quotes,chars).

% 765,800,002 inferences, 74.125 CPU in 74.348 seconds (100% CPU, 10331197 Lips)
% 765,800,001 inferences, 74.766 CPU in 74.958 seconds (100% CPU, 10242675 Lips)
% 765,800,001 inferences, 74.734 CPU in 74.943 seconds (100% CPU, 10246958 Lips)
% 765,800,001 inferences, 74.828 CPU in 75.036 seconds (100% CPU, 10234120 Lips)
% 765,800,001 inferences, 74.547 CPU in 74.625 seconds (100% CPU, 10272731 Lips)

Version 3 gives a slightly better but consistently better result than Version 2.

Lastly just changing double_quotes flag to atom as noted in a comment by AntonDanilov

Version: 4

:- set_prolog_flag(double_quotes,atom).

% 765,800,003 inferences, 84.234 CPU in 84.539 seconds (100% CPU, 9091300 Lips)
% 765,800,001 inferences, 74.797 CPU in 74.930 seconds (100% CPU, 10238396 Lips)
% 765,800,001 inferences, 75.125 CPU in 75.303 seconds (100% CPU, 10193677 Lips)
% 765,800,001 inferences, 75.078 CPU in 75.218 seconds (100% CPU, 10200042 Lips)
% 765,800,001 inferences, 75.031 CPU in 75.281 seconds (100% CPU, 10206414 Lips)

Version 4 is almost the same as Version 3.

Just looking at CPU numbers, using indexing is faster, e.g. (Version: 1) 151.875 vs (Version: 3) 74.547

I don't know what `sky-scrapper` means in the comment `% replace OR sky-scrapper with call to new predicate` so I asked it as a separate [question](https://stackoverflow.com/q/54276252/1243762). — Guy Coder, Jan 20 '19 at 12:07
sky scrappers are the multifloored extremely high buildings. there's lots of them in the big America's cities =D — Anton Danilov, Jan 20 '19 at 14:09
look at first code fragment looks like sky-scrapper manhattan new-york etc isn't it? — Anton Danilov, Jan 20 '19 at 14:15
well i'd explain: disjunctive chain just wil be walked thru blind search afaik . I replace them by clauses chain which are indexed by first arg — Anton Danilov, Jan 20 '19 at 14:19
Correct spelling is "skyscraper" (one 'p' is important in the long 'a' pronunciation in this case, and it just happens to be one word). But I've never heard this term used in reference to any kind of code structure before. Do you have a reference? — lurker, Jan 20 '19 at 15:33
skyscraper and ;/2 code structure is just a humorous comparison. I could name so any object resembling me a skyscraper :) due its big visual height. My replacement code also looks like skyscraper for me. — Anton Danilov, Jan 20 '19 at 16:55
@GuyCoder typical error ["..."] ==> "...". "..." is the list of `int`'s! — Anton Danilov, Jan 22 '19 at 02:55
i mean that "..." is list itself so there's no need to wrap it into another list — Anton Danilov, Jan 22 '19 at 12:36
Do what you want to do. Btw you can do the indexing even better: `:-set_prolog_flag(double_quotes, atoms ). ... [S], { op_token(S, Token) }` if i'm right there's no need to remove anything because `S` is atom. — Anton Danilov, Jan 24 '19 at 05:57
@GuyCoder Ok 9:30am is my local time. >> `I also appreciate that you take the time to research the details.` I did it also for my work — Anton Danilov, Jan 24 '19 at 06:32
What Prolog version? Concerning SWI-Prolog: Deep indexing was added in version 7.7.4. So you might see different measurements before and after this release. — , Mar 03 '19 at 04:49
You were using SWI-Prolog 8.0. Oki Doki. That skipped my attention. — , Mar 03 '19 at 17:28
@GuyCoder. I'm not able to reproduce the timing results you gave above. As-is the code is erroneous, so what are you actually measuring?! — repeat, Mar 05 '19 at 21:37
The cases "colon", "slash" and "backslash" are erroneous. (1x typo, 2x different token names)... — repeat, Mar 05 '19 at 21:38
@repeat Starting to remember the details. The sample input data used for the timing test was in the original question, but deleted because people were down voting the question for being to long. Look at this [edit](https://stackoverflow.com/posts/54259696/revisions) version. The data starts with line `ID GRAA_HUMAN Reviewed;` If I remember correctly the entire file was used because it was small enough. — Guy Coder, Mar 05 '19 at 23:23
@repeat You can also get the data used from the manual in this [section](https://www.uniprot.org/docs/userman.htm#entrystruc) You will have to put it into a file and should be 373 lines long, not the 65 million in the full file. — Guy Coder, Mar 05 '19 at 23:26
@repeat I tried to recreate the code and timing as posted in this answer and found that the code I used was not posted to the answer, again because people were complaining, and it is no longer on my machine. The description above on how the test were run is valid and should have enough detail on how to recreate the code to get the timings. — Guy Coder, Mar 06 '19 at 00:29

How can lexing efficiency be improved?

2 Answers2

Linked