Prolog String-Intensive Algorithm Crashes

Question

I am writing an algorithm to process a string, and it crashes (possibly due to a quirk in Prolog which makes string-intensive algorithms crash). How can I modify the algorithm so that it doesn't crash?

The algorithm replaces ", “, ”, ‘ and ’ with ', and \\ and - with nothing and breaks the string on \n\n.

It takes the inputted files:

raw_sources: 1.txt:

a
a

B
b
b

C
c
c

`2.txt`:

“”‘’'"
\\
- 


b

And outputs the files:

sources: 1.txt:

["a
a","B
b
b","C
c
c"]

`2.txt`:

["''''''","
b"]

The query: sheet_feeder(_).

The code so far:

sheet_feeder(T) :-
    directory_files("raw_sources/",F),
    delete_invisibles_etc(F,G),
    findall(K1,(member(H,G),        
    atom_concat('raw_sources/',H,String00b),
    phrase_from_file(string(String001), String00b),
    string_codes(String000,String001),
    string_concat(String000,"\n\n",String00_a),
    strip_illegal_chars(String00_a,"",String00),
        split_on_substring(String00,"\n\n",[],J1),
        delete(J1,"",K1),
        term_to_atom(K1,K),
        string_concat("sources/",H,String00bb),
    (open(String00bb,write,Stream1),
    write(Stream1,K),
    close(Stream1))
        ),T).

delete_invisibles_etc(F,G) :-
    findall(J,(member(H,F),
    atom_string(H,J),
    not(J="."),not(J=".."),not(string_concat(".",_,J))),G).

string(String) --> list(String).

list([]) --> [].
list([L|Ls]) --> [L], list(Ls).
    
strip_illegal_chars("",A,A) :- !.
strip_illegal_chars(A,B,E) :-
    string_concat(E1,D,A),
    string_length(E1,1),
    E1="\\",
    string_concat(B,"",F),
    strip_illegal_chars(D,F,E),!.
strip_illegal_chars(A,B,E) :-
    string_concat(E1,D,A),
    string_length(E1,2),
    E1="- ",
    string_concat(B,"",F),
    strip_illegal_chars(D,F,E),!.
strip_illegal_chars(A,B,E) :-
    string_concat(E1,D,A),
    string_length(E1,1),
    ((E1="\"" -> true;
    (E1="“" -> true;
    (E1="”" -> true;
    (E1="‘" -> true;
    (E1="’" -> true;
    (E1="'"))))))),
    string_concat(B,"'",F),
    strip_illegal_chars(D,F,E),!.
strip_illegal_chars(A,B,E) :-
    string_concat(C,D,A),
    string_length(C,1),
    string_concat(B,C,F),
    strip_illegal_chars(D,F,E),!.
    
split_on_substring([],_A,E,[E]) :- !. %% ***?
split_on_substring(A,B,E,C) :-
    append(B,D,A),
    split_on_substring(D,B,[],C1),
    string_codes(E1,E),
    append([E1],C1,C),!.
split_on_substring(A,B,E1,C) :-
    length(E,1),
    append(E,D,A),
    append(E1,E,E2),
    split_on_substring(D,B,E2,C),!.

If you are using SWI-Prolog, it has bindings for PCRE. just use that if it is good enough. If you cannot use regular expressions, you need to be actually parsing. — TA_intern, May 07 '21 at 07:01
It is simpler to use replacement as it does. The algorithm just replaces characters from input. — Lucian Green, May 12 '21 at 03:59

Lucian Green · Answer 1 · 2021-05-12T04:00:36.347

I stored the string as string codes to overcome the performance difficulties. Apparently, string_concat duplicates both the string concatenated and what is concatenated to it, but append just copies what is appended on to a list.

The solution:

sheet_feeder(T) :-
    directory_files("raw_sources/",F),
    delete_invisibles_etc(F,G),
    findall(K1,(member(H,G),        
    atom_concat('raw_sources/',H,String00b),
    phrase_from_file(string(String001), String00b),
    append(String001,`\n\n`,String00_a),
    strip_illegal_chars(String00_a,[],String00),
        split_on_substring(String00,`\n\n`,[],J1),
        delete(J1,"",K1),
        term_to_atom(K1,K),
        string_concat("sources/",H,String00bb),
    (open(String00bb,write,Stream1),
    write(Stream1,K),
    close(Stream1))
        ),T).

delete_invisibles_etc(F,G) :-
    findall(J,(member(H,F),
    atom_string(H,J),
    not(J="."),not(J=".."),not(string_concat(".",_,J))),G).

string(String) --> list(String).

list([]) --> [].
list([L|Ls]) --> [L], list(Ls).
    
strip_illegal_chars([],A,A) :- !.
strip_illegal_chars(A,B,E) :-
    length(E1,1),
    append(E1,D,A),
    E1=[92],
    append(B,``,F),
    strip_illegal_chars(D,F,E),!.
strip_illegal_chars(A,B,E) :-
    length(E1,2),
    append(E1,D,A),
    E1=`- `,
    append(B,``,F),
    strip_illegal_chars(D,F,E),!.
strip_illegal_chars(A,B,E) :-
    length(E1,1),
    append(E1,D,A),
    ((E1=`"` -> true;
    (E1=[8220] -> true;
    (E1=[8221] -> true;
    (E1=[8216] -> true;
    (E1=[8217] -> true;
    (E1=`'`))))))),
    append(B,`'`,F),
    strip_illegal_chars(D,F,E),!.
strip_illegal_chars(A,B,E) :-
    length(C,1),
    append(C,D,A),
    append(B,C,F),
    strip_illegal_chars(D,F,E),!.
    
split_on_substring([],_A,E,E) :- !. %% ***?
split_on_substring(A,B,E,C) :-
    append(B,D,A),
    split_on_substring(D,B,[],C1),
    string_codes(E1,E),
    append([E1],C1,C),!.
split_on_substring(A,B,E1,C) :-
    length(E,1),
    append(E,D,A),
    append(E1,E,E2),
    split_on_substring(D,B,E2,C),!.

Given that lists of codes (or chars) uses so many much memory than strings, I wonder what problem you fixed... and of course SWI-Prolog mantainers would be interested as well, I think... — CapelliC, May 07 '21 at 06:44
I got the idea by reading about the new command with_output_to on `https://github.com/kamahen/swipl-server-js-client/blob/master/simple_server.pl` (retweeted by SWI_Prolog) which addresses the performance issue. — Lucian Green, May 11 '21 at 12:45
I found a better way to replace strings at https://stackoverflow.com/questions/26973951/replacing-white-spaces-in-prolog#26974021 — Lucian Green, Mar 13 '22 at 01:33

Prolog String-Intensive Algorithm Crashes

1 Answers1