Space-efficient reading of chars in canonical form

Question

When using chars (lists of characters, thus atoms of length one) to represent text, we have the following options for writing them within terms:

"First," the double quoted list notation (6.3.7) is the most efficient one, requiring at least n+2 characters. But it can only be read back if the Prolog flag double_quotes is set to chars.
['N',e,x,t,','] comes the list notation with at least 2n+1 characters. While it is nice and relatively compact, it implies that also operators are used when writing other data since it is enabled with ignore_ops(false), and this necessitates that the same operators will be present when reading, making it quite brittle.
'.'('L','.'(a,'.'(s,'.'(t,'.'(',',[]))))) the canonical notation which uses functional form also for lists requiring at least 7n+2 characters. That is a lot, but for interoperability (and that includes interoperability with the same system) it is best since it neither depends on the double_quotes flag nor the various operator declarations.

Writing chars in canonical notation can be done in constant space. But for reading, the situation is a bit more tricky. After all, a sequence starting with '.'(a, may also refer to a term '.'(a,Further,b). So a naive reading will have to wait (and use space) until the entire list of chars is read in. On the other hand, it seems to be a safe bet that '.'(a, will be a list constructor '.'(a,Further). In other words,

How to read a term in canonical notation with constant auxiliary space for the reading of chars within?

In case it helps just consider terms sampleterm/1. So consider the reading of all such terms written in canonical form. And, if you like, formulate it as a DCG.

sampleterm([]).
sampleterm(a).
sampleterm(b).
sampleterm('.'(E,Es)) :- % the real list constructor 
   sampleterm(E),
   sampleterm(Es).
sampleterm('.'(E,F,G)) :- % no list constructor
   sampleterm(E),
   sampleterm(F),
   sampleterm(G).

If such space efficient reading is possible, then systems that support a compact internal representation of chars like Scryer and Trealla could even go a tiny step further.

Ah, lest I forget what I have tried: read/1 indeed, but currently it was not ideal.

just for clarity: do strings in canonical notation have to be lists or can they be binary trees? I assume former but in your example `sampleterm/1` (the real list constructor) would accept trees as well. — DuDa, Dec 06 '21 at 13:55
@DuDa: Note that I avoid to use the word "string" as it is so ambiguous. Focus is on lists of characters and their efficient reading (within general reading which also may include `'.'/3`). — false, Dec 06 '21 at 15:42
Could you add an example to illustrate the perceived problem with `[nice, list, syntax]` and operators? — Isabelle Newbie, Dec 06 '21 at 19:31
@IsabelleNewbie: See above, it's the option `ignore_ops(false)` which enables not only `[nice, list, syntax]` but also the current operators. — false, Dec 07 '21 at 11:03
@false: Your `sampleterm` program throws an [instantiation_error in SWI](https://swish.swi-prolog.org/p/space-efficient-reading-of-chars-in-canonical-form.pl) — gusbro, Dec 09 '21 at 14:15
@gusbro: This is about [tag:iso-prolog]-systems, SWI [no longer conforms](https://www.complang.tuwien.ac.at/ulrich/iso-prolog/SWI7_and_ISO). — false, Dec 09 '21 at 18:38

repeat · Answer 1 · 2022-06-16T15:40:07.730

The following straightforward code is based on Prolog streams.

It focuses on reading "real lists sampletrees" from repositionable streams.

For (1) non-repositionable streams and (2) handling '.'/3 we fall back to read/1.

The main predicate is read_sampleterm/1:

read_sampleterm(Term) :-
    current_input(S),
    (  stream_property(S,reposition(true)),
       stream_property(S,position(P)),
       (  aux_read_sampleterm_1(Term0),
          get_char('.')                    % this is sloppy!
       -> true
       ;  set_stream_position(S,P),
          fail
       )
    -> Term = Term0
    ;  read(Term)                          % fallback
    ).

Note that above code is sloppy: at the end of the read, we need to ensure that EOF or a character that does not combine with '.' follows.

The actual reading is done by the following auxiliary predicates:

aux_read_sampleterm_1(Term) :-
    get_char(Ch),
    aux_read_sampleterm_2(Ch,Term,0).      % use indexing

aux_read_sampleterm_2('\'',[X|Xs],N0) :-
    get_char('.'),
    get_char('\''),
    get_char('('),
    aux_read_sampleterm_1(X),
    get_char(','),
    N1 is N0 + 1,
    get_char(Ch),
    aux_read_sampleterm_2(Ch,Xs,N1).
aux_read_sampleterm_2('[',[],N) :-
    get_char(']'),
    eat_rparens(N).
aux_read_sampleterm_2(a,a,N) :-
    eat_rparens(N).
aux_read_sampleterm_2(b,b,N) :-
    eat_rparens(N).

eat_rparens(N) :-
    (  N > 0
    -> get_char(')'),
       N0 is N - 1,
       eat_rparens(N0)
    ;  true
    ).

To show some simple use cases we read from files:

read_sampleterm_from_file(File,Term) :-
    open(File,read,S,[type(text)]),
    current_input(S0),
    set_input(S),
    read_sampleterm(Term0),
    set_input(S0),
    close(S),
    Term = Term0.

Sample queries using GNU Prolog 1.5.0:

First, sample1.txt:

'.'(a,'.'(b,[])).

We get:

| ?- read_sampleterm_from_file('sample1.txt',T).

T = [a,b]

yes

Next, sample2.txt:

'.'(a,'.'(b,a)).

We get:

| ?- read_sampleterm_from_file('sample2.txt',T).

T = [a,b|a]

yes

sample3.txt is next:

'.'('.'(a,'.'(b,'.'(a,[]))),[]).

We get:

| ?- read_sampleterm_from_file('sample3.txt',T).

T = [[a,b,a]]

(1 ms) yes

Note that running above tests were run without the "fallback option".

You only are able to read canonical char lists with constant space, when the term is exactly one such list of chars. But not even a list of chars can be read in efficiently. So this answers the question only for one very specific case only. — false, Jun 16 '22 at 08:33
For clarification, consider `'.'('.'(a,'.'(b,'.'(c,[]))),[])` which is canonical for `["abc"]` — false, Jun 16 '22 at 08:52
@false. How embarrassing... it didn't even occur to me that atom input streams could lack the `reposition(true)` property. In fact, in GNU Prolog they do... — repeat, Jun 16 '22 at 15:42

notoria · Answer 2 · 2022-09-07T10:55:02.160

Lexer (lexer.pl):

:- module(lexer, []).

:- use_module(library(debug)).
:- use_module(library(dcgs)).
:- use_module(library(lists), [append/2, append/3, member/2]).


/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Tokens.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */

term(Ts) --> tokens(Ts).

% read_term(Ts) --> term(Ts0), end, !, { append(Ts0, [end], Ts) }.

read_term_([end]) --> end, !.
read_term_([T|Ts]) --> token(T), read_term_(Ts).

% Greedy.
tokens([T|Ts]) --> token(T), tokens(Ts).
tokens([]) --> [].

token(name(Cs)) --> name(Cs), !.
token(variable(Cs)) --> variable(Cs), !.
token(float_number(Cs)) --> float_number(Cs), !. % 3
token(integer(Cs)) --> integer(Cs), !. % 2
token(double_quoted_list(Cs)) --> double_quoted_list(Cs), !.
token(open) --> open, !.
token(open_ct) --> open_ct, !.
token(close) --> close_, !.
token(open_list) --> open_list, !.
token(close_list) --> close_list, !.
token(open_curly) --> open_curly, !.
token(close_curly) --> close_curly, !.
token(ht_sep) --> ht_sep, !.
token(comma) --> comma, !.

name(Cs) --> (layout_text_sequence '|' []), !, name_token(Cs).
variable(Cs) --> (layout_text_sequence '|' []), !, variable_token(Cs).
integer(Cs) --> (layout_text_sequence '|' []), !, integer_token(Cs).
float_number(Cs) --> (layout_text_sequence '|' []), !, float_number_token(Cs).
double_quoted_list(Cs) -->
    (layout_text_sequence '|' []), !, double_quoted_list_token(Cs).
open --> layout_text_sequence, open_token.
open_ct --> open_token.
close_ --> (layout_text_sequence '|' []), !, close_token.
open_list --> (layout_text_sequence '|' []), !, open_list_token.
close_list --> (layout_text_sequence '|' []), !, close_list_token.
open_curly --> (layout_text_sequence '|' []), !, open_curly_token.
close_curly --> (layout_text_sequence '|' []), !, close_curly_token.
ht_sep --> (layout_text_sequence '|' []), !, head_tail_separator_token.
comma --> (layout_text_sequence '|' []), !, comma_token.

end --> (layout_text_sequence '|' []), !, end_token.

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Layout text.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */

layout_text_sequence --> layout_text, layout_texts.

% Greedy.
layout_texts --> layout_text, layout_texts.
layout_texts --> [].

layout_text --> layout_char(_), !.
layout_text --> comment, !.

comment --> single_line_comment, !.
comment --> bracketed_comment, !.

single_line_comment --> end_line_comment_char(_), comment_text, new_line_char(_).
% Greedy. The order is important.
% single_line_comment -->
%     end_line_comment_char(_), comment_text, new_line_char(_), !.
% single_line_comment -->
%     end_line_comment_char(_), comment_text, [_], !, { false }.
% single_line_comment --> end_line_comment_char(_), comment_text.

bracketed_comment --> comment_open, comment_text, comment_close.

comment_open --> comment_1_char, comment_2_char.
comment_close --> comment_2_char, comment_1_char.
comment_text --> chars(_).

comment_1_char --> "/".
comment_2_char --> "*".

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Names.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */

name_token(Cs) --> letter_digit_token(Cs), !.
name_token(Cs) --> graphic_token(Cs), !.
name_token(Cs) --> quoted_token(Cs), !.
name_token(Cs) --> semicolon_token(Cs), !.
name_token(Cs) --> cut_token(Cs), !.

letter_digit_token([C|Cs]) --> small_letter_char(C), alphanumeric_chars(Cs).

graphic_token(_) --> ".", layout_char(_), !, { false }.
graphic_token(_) --> ".", end_line_comment_char(_), !, { false }.
graphic_token([C|Cs]) --> graphic_token_char(C), graphic_token_chars(Cs).

% Greedy.
graphic_token_chars([C|Cs]) --> graphic_token_char(C), graphic_token_chars(Cs).
graphic_token_chars([]) --> [].

graphic_token_char(C) --> graphic_char(C), !.
graphic_token_char(C) --> backslash_char(C), !.

quoted_token(Cs) -->
    single_quote_char(_),
    single_quoted_items(Cs),
    single_quote_char(_).

% Greedy.
single_quoted_items(Cs) -->
    single_quoted_item(Cs0),
    single_quoted_items(Cs1),
    { append(Cs0, Cs1, Cs) }.
single_quoted_items([]) --> [].

single_quoted_item([C]) --> single_quoted_character(C), !.
single_quoted_item([]) --> continuation_escape_sequence, !.

continuation_escape_sequence --> backslash_char(_), new_line_char(_).

semicolon_token([C]) --> semicolon_char(C).

cut_token([C]) --> cut_char(C).

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Quoted characters.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */

single_quoted_character(C) --> non_quote_char(C), !.
single_quoted_character(C) --> single_quote_char(C), single_quote_char(C), !.
single_quoted_character(C) --> double_quote_char(C), !.
single_quoted_character(C) --> back_quote_char(C), !.

double_quoted_character(C) --> non_quote_char(C), !.
double_quoted_character(C) --> single_quote_char(C), !.
double_quoted_character(C) --> double_quote_char(C), double_quote_char(C), !.
double_quoted_character(C) --> back_quote_char(C), !.

back_quoted_character(C) --> non_quote_char(C), !.
back_quoted_character(C) --> single_quote_char(C), !.
back_quoted_character(C) --> double_quote_char(C), !.
back_quoted_character(C) --> back_quote_char(C), back_quote_char(C), !.

non_quote_char(C) --> graphic_char(C), !.
non_quote_char(C) --> alphanumeric_char(C), !.
non_quote_char(C) --> solo_char(C), !.
non_quote_char(C) --> space_char(C), !.
non_quote_char(C) --> meta_escape_sequence(C), !.
non_quote_char(C) --> control_escape_sequence(C), !.
non_quote_char(C) --> octal_escape_sequence(C), !.
non_quote_char(C) --> hexadecimal_escape_sequence(C), !.

meta_escape_sequence(C) --> backslash_char(_), meta_char(C0),
    { member(C0-C, [('\\')-('\\'), ''''-'''', '"'-'"', '`'-'`']), ! }.

control_escape_sequence(C) --> backslash_char(_), symbolic_control_char(C0),
    {   member(
            C0-C,
            [
                'a'-'\a',
                'b'-'\b',
                'r'-'\r',
                'f'-'\f',
                't'-'\t',
                'n'-'\n',
                'v'-'\v'
            ]
        ), !
    }.

symbolic_control_char(C) --> symbolic_alert_char(C), !.
symbolic_control_char(C) --> symbolic_backspace_char(C), !.
symbolic_control_char(C) --> symbolic_carriage_return_char(C), !.
symbolic_control_char(C) --> symbolic_form_feed_char(C), !.
symbolic_control_char(C) --> symbolic_horizontal_tab_char(C), !.
symbolic_control_char(C) --> symbolic_new_line_char(C), !.
symbolic_control_char(C) --> symbolic_vertical_tab_char(C), !.

symbolic_alert_char('a') --> "a".
symbolic_backspace_char('b') --> "b".
symbolic_carriage_return_char('r') --> "r".
symbolic_form_feed_char('f') --> "f".
symbolic_horizontal_tab_char('t') --> "t".
symbolic_new_line_char('n') --> "n".
symbolic_vertical_tab_char('v') --> "v".

octal_escape_sequence(C) -->
    backslash_char(_),
    octal_digit_char(C0),
    octal_digit_chars(Cs),
    backslash_char(_),
    {   number_chars(N, ['0', 'o', C0|Cs]),
        char_code(C, N)
    }.

hexadecimal_escape_sequence(C) -->
    backslash_char(_),
    symbolic_hexadecimal_char(C0),
    hexadecimal_digit_char(C1),
    hexadecimal_digit_chars(Cs),
    backslash_char(_),
    {   number_chars(N, ['0', C0, C1|Cs]),
        char_code(C, N)
    }.

symbolic_hexadecimal_char('x') --> "x".

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Variables.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */

variable_token(Cs) --> named_variable(Cs), !. % 1
variable_token(Cs) --> anonymous_variable(Cs), !. % 0

anonymous_variable([C]) --> variable_indicator_char(C).

named_variable([C0, C1|Cs]) -->
    variable_indicator_char(C0), !,
    alphanumeric_char(C1),
    alphanumeric_chars(Cs).
named_variable([C|Cs]) -->
    capital_letter_char(C), !,
    alphanumeric_chars(Cs).

variable_indicator_char(C) --> underscore_char(C).

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Integer numbers.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */

integer_token(Cs) --> character_code_constant(Cs), !.
integer_token(Cs) --> binary_constant(Cs), !.
integer_token(Cs) --> octal_constant(Cs), !.
integer_token(Cs) --> hexadecimal_constant(Cs), !.
integer_token(Cs) --> integer_constant(Cs), !.

integer_constant([C|Cs]) --> decimal_digit_char(C), decimal_digit_chars(Cs).

character_code_constant(['0', C0, C]) -->
    "0", single_quote_char(C0), single_quoted_character(C).

binary_constant(Cs) -->
    binary_constant_indicator(Cs0),
    binary_digit_char(C),
    binary_digit_chars(Cs1),
    { append(Cs0, [C|Cs1], Cs) }.

binary_constant_indicator("0b") --> "0b".

octal_constant(Cs) -->
    octal_constant_indicator(Cs0),
    octal_digit_char(C),
    octal_digit_chars(Cs1),
    { append(Cs0, [C|Cs1], Cs) }.

octal_constant_indicator("0o") --> "0o".

hexadecimal_constant(Cs) -->
    hexadecimal_constant_indicator(Cs0),
    hexadecimal_digit_char(C),
    hexadecimal_digit_chars(Cs1),
    { append(Cs0, [C|Cs1], Cs) }.

hexadecimal_constant_indicator("0x") --> "0x".

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Floating point numbers.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */

float_number_token(Cs) -->
    integer_constant(Cs0),
    fraction(Cs1),
    exponent(Cs2),
    { append([Cs0, Cs1, Cs2], Cs) }.

fraction([C0, C1|Cs]) -->
    decimal_point_char(C0),
    decimal_digit_char(C1),
    decimal_digit_chars(Cs).

% Greedy.
exponent([C|Cs]) --> exponent_char(C), sign(Cs0), integer_constant(Cs1), !,
    { append(Cs0, Cs1, Cs) }.
exponent([]) --> [].

% Greedy.
sign([C]) --> negative_sign_char(C), !.
sign([C]) --> positive_sign_char(C), !.
sign([]) --> [].

positive_sign_char('+') --> "+".
negative_sign_char('-') --> "-".
decimal_point_char('.') --> ".".
exponent_char(C) --> [C], { member(C, "eE"), ! }.

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Double quoted lists.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */

double_quoted_string(Cs) -->
    double_quote_char(_),
    double_quoted_item(Cs0),
    double_quoted_char(_),
    append(["""", Cs0, """"], Cs).

% Greedy.
double_quoted_items(Cs) -->
    double_quoted_item(Cs0), double_quoted_items(Cs1),
    { append(Cs0, Cs1, Cs) }.
double_quoted_items([]) --> [].

double_quoted_item([C]) --> double_quoted_character(C), !.
double_quoted_item([]) --> continuation_escape_sequence, !.

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Double quoted lists.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */

double_quoted_list_token(Cs) -->
    double_quote_char(C),
    double_quoted_items(Cs),
    double_quote_char(C).

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Back quoted strings.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */

back_quoted_string -->
    back_quote_char,
    back_quoted_items,
    back_quoted_char.

% Greedy.
back_quoted_items --> back_quoted_item, back_quoted_items.
back_quoted_items --> [].

back_quoted_item --> back_quoted_character, !.
back_quoted_item --> continuation_escape_sequence, !.

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Other tokens.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */

open_token --> open_char(_).
close_token --> close_char(_).
open_list_token --> open_list_char(_).
close_list_token --> close_list_char(_).
open_curly_token --> open_curly_char(_).
close_curly_token --> close_curly_char(_).
head_tail_separator_token --> head_tail_separator_char(_).
comma_token --> comma_char(_).

% The order is important.
% Greedy. TODO: Find better.
end_token, [C] --> end_char, layout_char(C), !.
end_token, "%" --> end_char, end_line_comment_char(_), !.
end_token --> end_char, [_], !, { false }.
end_token --> end_char.

end_char --> ".".

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Processor character set.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */

% Not greedy.
chars([]) --> [].
chars([C|Cs]) --> char(C), chars(Cs).

char(C) --> graphic_char(C), !.
char(C) --> alphanumeric_char(C), !.
char(C) --> solo_char(C), !.
char(C) --> layout_char(C), !.
char(C) --> meta_char(C), !.
char(C) --> [C], { write('Accepting: \''), write(C), write(''''), nl }.

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Graphic characters.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */

graphic_char(C) --> [C], { member(C, "#$&*+-./:<=>?@^~"), ! }.

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Alphanumeric characters.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */

% Greedy.
alphanumeric_chars([C|Cs]) -->
    alphanumeric_char(C),
    alphanumeric_chars(Cs).
alphanumeric_chars([]) --> [].

alphanumeric_char(C) --> alpha_char(C), !.
alphanumeric_char(C) --> decimal_digit_char(C), !.

alpha_char(C) --> underscore_char(C), !.
alpha_char(C) --> letter_char(C), !.

letter_char(C) --> capital_letter_char(C), !.
letter_char(C) --> small_letter_char(C), !.

% Greedy.
decimal_digit_chars([C|Cs]) --> decimal_digit_char(C), decimal_digit_chars(Cs).
decimal_digit_chars([]) --> [].
% Greedy.
binary_digit_chars([C|Cs]) --> binary_digit_char(C), binary_digit_chars(Cs).
binary_digit_chars([]) --> [].
% Greedy.
octal_digit_chars([C|Cs]) --> octal_digit_char(C), octal_digit_chars(Cs).
octal_digit_chars([]) --> [].
% Greedy.
hexadecimal_digit_chars([C|Cs]) -->
    hexadecimal_digit_char(C), hexadecimal_digit_chars(Cs).
hexadecimal_digit_chars([]) --> [].

small_letter_char(C) --> [C], { member(C, "abcdefghijklmnopqrstuvwxyz"), ! }.
capital_letter_char(C) --> [C], { member(C, "ABCDEFGHIJKLMNOPQRSTUVWXYZ"), ! }.
decimal_digit_char(C) --> [C], { member(C, "0123456789"), ! }.
binary_digit_char(C) --> [C], { member(C, "01"), ! }.
octal_digit_char(C) --> [C], { member(C, "01234567"), ! }.
hexadecimal_digit_char(C) --> [C], { member(C, "0123456789AaBbCcDdEeFf"), ! }.
underscore_char('_') --> "_".

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Solo characters.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */

solo_char(C) --> cut_char(C), !.
solo_char(C) --> open_char(C), !.
solo_char(C) --> close_char(C), !.
solo_char(C) --> comma_char(C), !.
solo_char(C) --> semicolon_char(C), !.
solo_char(C) --> open_list_char(C), !.
solo_char(C) --> close_list_char(C), !.
solo_char(C) --> open_curly_char(C), !.
solo_char(C) --> close_curly_char(C), !.
solo_char(C) --> head_tail_separator_char(C), !.
solo_char(C) --> end_line_comment_char(C), !.

cut_char('!') --> "!".
open_char('(') --> "(".
close_char(')') --> ")".
comma_char((',')) --> ",".
semicolon_char(';') --> ";".
open_list_char('[') --> "[".
close_list_char(']') --> "]".
open_curly_char('{') --> "{".
close_curly_char('}') --> "}".
head_tail_separator_char('|') --> "|".
end_line_comment_char('%') --> "%".

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Layout characters.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */

layout_char(C) --> space_char(C), !.
layout_char(C) --> horizontal_tab_char(C), !.
layout_char(C) --> new_line_char(C), !.

space_char(' ') --> " ".
horizontal_tab_char('\t') --> "\t".
new_line_char('\n') --> "\n".

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Meta characters.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */

meta_char(C) --> backslash_char(C), !.
meta_char(C) --> single_quote_char(C), !.
meta_char(C) --> double_quote_char(C), !.
meta_char(C) --> back_quote_char(C), !.

backslash_char('\\') --> "\\".
single_quote_char('''') --> "'".
double_quote_char('"') --> """".
back_quote_char('`') --> "`".

Parser (parser.pl):

:- module(parser, []).

:- use_module(library(debug)).
:- use_module(library(dcgs)).
:- use_module(library(error)).
:- use_module(library(format)).
:- use_module(library(lists)).


state(S), [S] --> [S].
state(S0, S), [S] --> [S0].

print --> state(s(Ts, Ss, CsVs)), { format("~q ~q ~q\n", [Ts, Ss, CsVs]) }.
print(Cs) -->
    state(s(Ts, Ss, CsVs)),
    { format("~s ~q ~q ~q\n", [Cs, Ts, Ss, CsVs]) }.


lookahead(T) --> state(s([T|_], _, _)).

shift --> state(s([name(Cs)|Ts], Ss, CsVs), s(Ts, [term(atom,A)|Ss], CsVs)),
    { atom_chars(A, Cs) }.
shift -->
    state(
        s([open_list,close_list|Ts], Ss, CsVs),
        s(Ts, [term(atom,[])|Ss], CsVs)
    ).
shift -->
    state(
        s([open_curly,close_curly|Ts], Ss, CsVs),
        s(Ts, [term(atom,{})|Ss], CsVs)
    ).
shift -->
    state(s([variable(Cs)|Ts], Ss, CsVs0), s(Ts, [term(variable,V)|Ss], CsVs)),
    { variable(Cs, CsVs0, V, CsVs) }.
shift -->
    state(
        s([integer(Cs)|Ts], [term(atom,-)|Ss], CsVs),
        s(Ts, [term(integer,N)|Ss], CsVs)
    ), !,
    { number_chars(N, [-|Cs]) }.
shift -->
    state(
        s([float_number(Cs)|Ts], [term(atom,-)|Ss], CsVs),
        s(Ts, [term(float,N)|Ss], CsVs)
    ), !,
    { number_chars(N, [-|Cs]) }.
shift -->
    state(
        s([integer(Cs)|Ts], Ss, CsVs),
        s(Ts, [term(integer,N)|Ss], CsVs)
    ),
    { number_chars(N, Cs) }.
shift -->
    state(
        s([float_number(Cs)|Ts], Ss, CsVs),
        s(Ts, [term(float,N)|Ss], CsVs)
    ),
    { number_chars(N, Cs) }.
shift -->
    state(
        s([open_ct|Ts], [term(atom,A)|Ss], CsVs),
        s(Ts, [compound(A,As,As)|Ss], CsVs)
    ), !.
shift --> state(s([open|Ts], Ss, CsVs), s(Ts, [open|Ss], CsVs)).
shift --> state(s([open_ct|Ts], Ss, CsVs), s(Ts, [open_ct|Ss], CsVs)).
% /*
shift -->
    state(
        s([close|Ts], [term(_,T),dot(0,Cs0)|Ss], CsVs),
        s(Ts, [term(chars, Cs)|Ss], CsVs)
    ),
    { append(Cs0, T, Cs) }.
shift -->
    state(
        s([close|Ts], [term(_,T),dot(N0,Cs0)|Ss], CsVs),
        s(Ts, [dot(N,Cs)|Ss], CsVs)
    ),
    { succ(N, N0), append(Cs0, T, Cs) }.
shift -->
    state(
        s([close|Ts], [dot(0,Cs)|Ss], CsVs),
        s(Ts, [term(chars,Cs)|Ss], CsVs)
    ).
shift -->
    state(
        s([close|Ts], [dot(N0,Cs)|Ss], CsVs),
        s(Ts, [dot(N,Cs)|Ss], CsVs)
    ),
    { succ(N, N0) }.
% */
shift -->
    state(
        s([close|Ts], [term(_,T0),compound(A,As,[T0])|Ss], CsVs),
        s(Ts, [term(compound,T)|Ss], CsVs)
    ),
    { T =.. [A|As] }.
shift -->
    state(
        s([close|Ts], [term(_,T),open|Ss], CsVs),
        s(Ts, [term(compound,T)|Ss], CsVs)
    ).
shift -->
    state(
        s([close|Ts], [term(_,T),open_ct|Ss], CsVs),
        s(Ts, [term(compound,T)|Ss], CsVs)
    ).
% /*
shift -->
    state(
        % s([comma|Ts], [term(atom,A),compound('.',As,As))|Ss], CsVs),
        s([comma|Ts], [term(atom,A),compound('.',As,_)|Ss], CsVs),
        s(Ts, [dot(0,[A])|Ss], CsVs)
    ),
    % { acyclic_term(As), atom_length(A, 1) },
    { var(As), atom_length(A, 1) },
    reduce, !.
shift -->
    state(
        s([comma|Ts], [term(_,T),dot(0,[A])|Ss], CsVs),
        s(Ts, [compound('.',[A,T|As],As)|Ss], CsVs)
    ).
shift -->
    state(
        s([comma|Ts], [term(_,T),dot(N0,Cs0)|Ss], CsVs),
        s(Ts, [compound('.',[A,T|As],As),dot(N,Cs)|Ss], CsVs)
    ),
    { succ(N, N0), append(Cs, [A], Cs0) }.
shift -->
    state(
        s([comma|Ts], [dot(0,[A|Cs])|Ss], CsVs),
        s(Ts, [compound('.',[A,Cs|As],As)|Ss], CsVs)
    ).
shift -->
    state(
        s([comma|Ts], [dot(N0,Cs0)|Ss], CsVs),
        s(Ts, [compound('.',[A,Cs1|As],As),dot(N,Cs)|Ss], CsVs)
    ),
    {   succ(N, N0),
        length(Cs, N0),
        append(Cs, [A|Cs1], Cs0)
    }.
% */
shift -->
    state(
        s([comma|Ts], [term(_,T),compound(A,As0,[T|As])|Ss], CsVs),
        s(Ts, [compound(A,As0,As)|Ss], CsVs)
    ).

reduce -->
    state(
        s(Ts, [dot(0,[A]),dot(N0,Cs0)|Ss], CsVs),
        s(Ts, [dot(N,Cs)|Ss], CsVs)
    ),
    { succ(N0, N), append(Cs0, [A], Cs) }, !.
reduce --> [].



read_term_ -->
    % print,
    lookahead(end), !,
    accept.
read_term_ -->
    shift, !,
    read_term_.

accept --> state(s([end], [term(_,_)], _)), !.
accept -->
    print("End"), { false }.
    % { throw(error(reduction(imcomplete), read_term//0)) }.

variable("_", CsVs, _, CsVs) :- !.
variable(Cs, CsVs, V, CsVs) :-
    member(Cs-V, CsVs), !.
variable(Cs, CsVs, V, [Cs-V|CsVs]).

succ(X, S) :-
    can_be(not_less_than_zero,X),
    can_be(not_less_than_zero,S),
    ( nonvar(X) -> S is X+1 ; X is S-1, X >= 0 ).

sml(S, M, Xs0, Xs) :-
    '$skip_max_list'(S, M, Xs0, Xs).

The initialization file init.pl:

:- use_module(library(debug)).
:- use_module(library(charsio)).
:- use_module(library(dcgs)).
:- use_module(library(dif)).
:- use_module(library(lists)).
:- use_module(library(iso_ext)).
:- use_module(library(pairs)).
:- use_module(library(pio)).
:- use_module(library(format)).
:- use_module(lexer).
:- use_module(parser).

parse(Cs, Ts, Ss0, S) :-
    phrase(lexer:read_term_(Ts), Cs),
    Ss0 = [s(Ts, [], [])|_],
    phrase(parser:read_term_, Ss0, [S]).

sample(a) --> [a].
sample([]) --> [[]].
sample('.'(E)) --> ['.'], sample(E).
sample('.'(E,Es)) --> ['.'], sample(E), sample(Es).
sample('.'(E,F,G)) --> ['.'], sample(E), sample(F), sample(G).
sample('.'(E,F,G,H)) --> ['.'], sample(E), sample(F), sample(G), sample(H).

generate(T, N) :-
    length(Cs, N),
    phrase(sample(T), Cs).

test :-
    once(generate('.'(a,'.'(a,'.'(a,'.'(a,[]))),a), _)),
    % N = 72849,
    call_nth(user:generate(T, _), N),
    ( N mod 2^10 =:= 0, writeq(N), nl, false ; true ),
    % T = '.'('.'(a,[]),[]),
    % T = '.'('.'(aa,'.'(b,'.'(c,[]))),[]),
    write_term_to_chars(T, [quoted(true),ignore_ops(true)], Cs0),
    append(Cs0, ".", Cs),
    (   parse(Cs, _, _, _) ->
        parse(Cs, Ts, Ss, S),
        S = s(_,[term(_,T0)],_),
        T0 \== T,
        format("N = ~q,\nCs = ~q,\nTs = ~q,\nSs = ~q,\nS = ~q.\n\n", [N,Cs,Ts,Ss,S])
    ;   format("N = ~q,\nCs = ~q,\nT = ~q.\n\n", [N,Cs,T])
    ),
    halt.
test :-
    halt.

dif(V, S0, V) :-
    dif(V, S0).

ns_ --> "(".
ns_ --> ")".
ns_ --> "'.'".
ns_ --> "a".

ns --> [].
ns --> ns_, ns.

nonsense(Cs) :-
    length(Cs, _),
    [C0|Cs0] = Cs,
    foldl(dif, Cs0, C0, _),
    phrase(ns, Cs).

nonsense :-
    call_nth(nonsense(Cs0), N),
    ( N mod 2^10 =:= 0, writeq(N), nl, false ; true ),
    append(Cs0, ".", Cs),
    (   parse(Cs, Ts, Ss, S),
        \+ catch(read_from_chars(Cs, _), error(syntax_error(_),_), false),
        format("N = ~q,\nCs = ~q,\nTs = ~q,\nSs = ~q,\nS = ~q.\n\n", [N,Cs,Ts,Ss,S])
    ;   catch(read_from_chars(Cs, T), error(syntax_error(_),_), false),
        \+ parse(Cs, Ts, Ss, S),
        \+ member(N, [161,749,822,3819]),
        format("N = ~q,\nCs = ~q,\nT = ~q.\n\n", [N,Cs,T])
    ),
    halt.
nonsense :-
    halt.

Using Scryer Prolog with scryer-prolog init -g test. It can also be queried like ?- parse("'.'(a,[]).", Ts, Ss, S).. By uncommenting print//0 in read_term_//0 in parser.pl (or using state(S0, S), [S] --> [S0,S].), the transition can be visualized.

In the step by step, dot/2 and term(chars,_) use a compact internal representation.

Some information is lost in this implementation (can't be seeing but present), example when parsing '.'('.'(a,[]),[]).:

...
[close,comma,open_list,close_list,close,end] [term(atom,[]),dot(0,"a"),compound('.',A,A)] []
[comma,open_list,close_list,close,end] [term(chars,"a"),compound('.',A,A)] []
[open_list,close_list,close,end] [compound('.',["a"|A],A)] [] % "a" still compactly stored.
...

Some grammar rules of shift//0 can be commented to fall back to the unoptimized parser.

The space complexity is O(ln N) where N is the size of the list of characters (not token nor input).

In the worst case ([], '.'(a,a([])), '.'(a,a('.'(a,a([])))), etc), the space complexity is O(N ln N) where N is the number of tokens.

In the best case ([], '.'(a,[]), '.'(a,'.'(a,[])), etc), the space complexity is O(ln N) where N is the number of tokens.

The enumeration in `sample/2` is inefficient, as the very same terms are generated over and over again. Instead use a difference for the bound which assures that each answer/solution is unique. — false, Sep 01 '22 at 06:53
minor: use [`succ/2`](https://www.complang.tuwien.ac.at/ulrich/iso-prolog/prologue#succ) — false, Sep 01 '22 at 06:57
`succ(X, S) :- can_be(not_less_than_zero,X), can_be(not_less_than_zero,S), ( nonvar(X) -> S is X+1 ; X is S-1, X >= 0 ).` — false, Sep 01 '22 at 07:35

notoria · Answer 3 · 2022-09-02T07:26:59.790

The order is important because of cut and regular compound term.

shift --> % 0
    state(
        s([close|Ts], [term(_,T),dot(0,Cs0)|Ss], CsVs),
        s(Ts, [term(chars, Cs)|Ss], CsVs)
    ),
    { append(Cs0, T, Cs) }.
shift --> % 1
    state(
        s([close|Ts], [term(_,T),dot(N0,Cs0)|Ss], CsVs),
        s(Ts, [dot(N,Cs)|Ss], CsVs)
    ),
    { succ(N, N0), append(Cs0, T, Cs) }.
shift --> % 2
    state(
        s([close|Ts], [dot(0,Cs)|Ss], CsVs),
        s(Ts, [term(chars,Cs)|Ss], CsVs)
    ).
shift --> % 3
    state(
        s([close|Ts], [dot(N0,Cs)|Ss], CsVs),
        s(Ts, [dot(N,Cs)|Ss], CsVs)
    ),
    { succ(N, N0) }.
shift --> % 4
    state(
        % s([comma|Ts], [term(atom,A),compound('.',As,As))|Ss], CsVs),
        s([comma|Ts], [term(atom,A),compound('.',As,_)|Ss], CsVs),
        s(Ts, [dot(0,[A])|Ss], CsVs)
    ),
    % { acyclic_term(As), atom_length(A, 1) },
    { var(As), atom_length(A, 1) },
    reduce, !.
shift --> % 5
    state(
        s([comma|Ts], [term(_,T),dot(0,[A])|Ss], CsVs),
        s(Ts, [compound('.',[A,T|As],As)|Ss], CsVs)
    ).
shift --> % 6
    state(
        s([comma|Ts], [term(_,T),dot(N0,Cs0)|Ss], CsVs),
        s(Ts, [compound('.',[A,T|As],As),dot(N,Cs)|Ss], CsVs)
    ),
    { succ(N, N0), append(Cs, [A], Cs0) }.
shift --> % 7
    state(
        s([comma|Ts], [dot(0,[A|Cs])|Ss], CsVs),
        s(Ts, [compound('.',[A,Cs|As],As)|Ss], CsVs)
    ).
shift --> % 8
    state(
        s([comma|Ts], [dot(N0,Cs0)|Ss], CsVs),
        s(Ts, [compound('.',[A,Cs1|As],As),dot(N,Cs)|Ss], CsVs)
    ),
    {   succ(N, N0),
        length(Cs, N0),
        append(Cs, [A|Cs1], Cs0)
    }.

reduce -->
    state(
        s(Ts, [dot(0,[A]),dot(N0,Cs0)|Ss], CsVs),
        s(Ts, [dot(N,Cs)|Ss], CsVs)
    ),
    { succ(N0, N), append(Cs0, [A], Cs) }, !.
reduce --> [].

succ(X, S) :-
    can_be(not_less_than_zero,X),
    can_be(not_less_than_zero,S),
    ( nonvar(X) -> S is X+1 ; X is S-1, X >= 0 ).

This is the part added to the parser. It has been simplified a bit.

This parser is optimistic and assumes a list of characters is to be parsed when it sees '.'( followed by a character.

For an input like '.'(a,'.'(a,b,b),b), the parser will first build "aa" (chars) then undo it step by step to build the term. The first undo builds '.'(a,b,b) but the parser is still optimistic since [a|'.'(a,b,b)] could be built so "a" is a chars. But an undo is done again to build '.'(a,'.'(a,b,b),b).

There is also the case where the input is '.'(a,'.'(b,[]),c), the parse builds "ab" then needs to undo to build '.'(a,"b",c). The chars "ab" is split with "b" being a chars.

An input more complicated is like '.'(a,'.'(b,'.'(c,[]),d)), the parser builds "abc", then undo by splitting as "a", '.'(b and "c". Now the parser builds '.'(b,"c",d) with "c" being a chars and "a" also being a chars. Finally, it finishes parsing to [a|'.'(b,"c",d)] with [a| and "c" being chars.

It starts with % 4, if the compound term with atom '.' doesn't have any argument and the atom is a character then assumes a list of characters will be parsed and store the character efficiently (dot(0,[A])). Then reduce//0 the dots, on a procedural language the stack could be inspected further instead of using reduce like here.

A finalized chars is built after % 0 or % 2.

The splitting is done with % 6, % 7 and % 8.

The shift % 5 handles '.'(a,T,, it turns "a" into that.

The shift % 3 confirms how much of a chars is actually a chars. In dot(N, Cs), N may not be a chars but the tail is.

The shift % 1 inserts the tails and confirms by decreasing N0.

Clearly a step ahead! I more and more believe that the cleanest way would be to solve this in the tokenizer. — false, Sep 07 '22 at 06:05
A [lexerless parser](https://en.wikipedia.org/wiki/Scannerless_parsing)? When the parser sees `'.'(a,'.'(a,`, it isn't certain that the result will be a `chars` since `b,b),b)` can follow. So here, after 8 tokens the parser still can't decide. — notoria, Sep 07 '22 at 07:06
The tokenizer could count that number of missing closing round brackets. Thus it can detect a canonical string just during parsing, and for more complex cases it could produce one (kind of) list prefix as a token to be left for the actual parser. Somewhat similar to the way how [constant space writing](https://stackoverflow.com/a/59835953/772868) is done. — false, Sep 07 '22 at 07:15
Possible but it will be like `dot(N,Cs)` since `N+1` is the number of parentheses needed to confirm the first part. — notoria, Sep 07 '22 at 07:45
Right, so strictly speaking, it's not about constant space but rather logarithmic, except that a significant growth is not observable to us mortals in this tiny finite world. — false, Sep 07 '22 at 08:36

Space-efficient reading of chars in canonical form

3 Answers3