Sliding window method for pattern recognition in SWI-Prolog

Question

I would like to find the most eloquent and efficient method, algorithmically speaking, to count occurrences of some patterns in SWI-Prolog.

For now, my solution uses DCG and looks like this:

count_occurrences(Pattern, List, Count) :-
    phrase(Pattern, PatternList),
    length(PatternList, PatternLength),
    length(List, ListLength),
    MaxIndex is ListLength - PatternLength,
    findall(1, (
        between(0, MaxIndex, Index),
        sublist(List, Index, PatternLength, PatternList)
    ), Counts),
    sum_list(Counts, Count).

sublist(List, Index, Length, Sublist) :-
    length(Sublist, Length),
    append(Prefix, Rest, List),
    length(Prefix, Index),
    append(Sublist, _, Rest).

rep(S, L) --> [], {L is 0} | [S], {L is 1} | [S], { L_1 is L - 1 }, rep(S, L_1), !.
open_rep(S, L) --> [v], rep(S, L), [v], !.

The result is this:

1 ?- count_occurrences(open_rep(n, 1), [v,n,v,n,v,v,v,v,v,v,n,v,v], Count).
Count = 3.

2 ?- count_occurrences(open_rep(n, 3), [v,n,v,n,v,v,v,v,v,v,n,n,n,v,v], Count).
Count = 1.

3 ?- count_occurrences(open_rep(n, 1), [v,n,v,n,v,v,v,v,v,v,n,n,n,v,v], Count).
Count = 2.

I am satisfied with this result, but I don't think this is the most efficient method to achieve it and would like some help making it faster to compute. I think that using append and findall is computationally expensive, but can't figure how to do the same without them...

The DCG shown here is just an example, but I have to count the occurrences of patterns in a list and give them a score. This is in the context of implementing an AI for Gomoku using Alpha-Beta Pruning. Since the board is evaluated frequently, algorithmic complexity matters in order to reduce the time it takes for the AI to take an action.

I have tried multiple variations of the code shown above, but they all use the findall predicate and the best solution I have found to reduce the compute time is to implement early fails.

Too vague. "The DCG shown here is just an example" - give us a sensible example. — brebs, Mar 03 '23 at 22:13
Covering this example only should be sufficient since my DCG expressions are used for recognizing simple Gomoku line patterns sur as: `──◯─●─●─●─●───` (closed four in a row), `──●─●───●─●───` (semi-opened alignement of four), `──●─●─●─●─────` (opened four in a row)... — René Chenard, Mar 04 '23 at 15:00
if it were _sliding window_, `open_rep(n, 1), [v,n,v,n,v,v,v,v,v,v,n,n,n,v,v]` would return 5, not 2. — Will Ness, Mar 04 '23 at 16:07
@WillNess Using `open_rep(n, 1)` on `[v,n,v,n,v,v,v,v,v,v,n,n,n,v,v]` is the equivalent of doing a convolution of `[v,n,v]` over `[v,n,v,n,v,v,v,v,v,v,n,n,n,v,v]`... Which would result in `[1,0,1,0,0,0,0,0,0,0,0,0,0,0]`. Then summing the content of this array, should result in `2`, not `5`... — René Chenard, Mar 04 '23 at 20:43
It is still not clear why you are writing all this code to achieve something so small. There is something missing from your question presumably. — TA_intern, Mar 05 '23 at 14:15
For those who are interested in my latest solution, here is [the link](https://github.com/RECHE23/SWI-Prolog-Gomoku) to my full implementation of Gomoku in SWI-Prolog. I intend to refine my code and implement faster algorithms and better heuristics... I will probably even try to implement some machine learning or renforcement learning methods eventually. — René Chenard, Mar 13 '23 at 23:50

TA_intern · Answer 1 · 2023-03-06T10:18:14.803

Something in your question is missing, can't tell what. I am also surprised by the answers.

Not sure if you must use lists or if you can use atoms or strings. Atoms and strings have sub_atom/5 and sub_string/5, resp, which can be used for searching. Like this:

?- sub_atom(vnvnvvvvvvnvv, Before, Len, After, vnv).
Before = 0, Len = 3, After = 10 ;
Before = 2, Len = 3, After = 8 ;
Before = 9, Len = 3, After = 1 ;
false.

The same but with strings works with sub_string/5.

And if you are already using SWI-Prolog, you can use library(aggregate) to count solutions:

?- aggregate_all(count, sub_atom(vnvnvvvvvvnvv, _,_,_, vnv), N).
N = 3.

?- aggregate_all(count, sub_atom(vnvnvvvvvvnnnvv, _,_,_, vnv), N).
N = 2.

So far you don't need to program anything. If you must use lists, the two appends are enough:

?- aggregate_all(
      count,
      ( append(_, Back, [v,n,v,n,v,v,v,v,v,v,n,n,n,v,v]),
        append([v,n,n,n,v], _, Back)
      ),
      N).
N = 1.

And remember that you also have the regular expression library library(pcre). If you use the example code from the docs:

:- use_module(library(pcre)).

re_match_count(Regex, String, N) :-
    re_foldl(inc, Regex, String, 0, N, []).

inc(_, V0, V) :- V is V0 + 1.

you can also count, using either atoms or strings:

?- re_match_count('vn{1}(?=v)', vnvnvvvvvvnvv, N).
N = 3.

?- re_match_count("vn{3}(?=v)", "vnvnvvvvvvnnnvv", N).
N = 1.

(Note: these regexes use "zero-width positive lookahead". They were initially borked by me, and fixed after Will Ness noticed the mistake).

In this example all information about the match itself is discarded and only the number of matches is counted. You could do anything you feel using the first argument to re_foldl/6.

Note: you need to measure what works fastest for you. If there is a way to read and keep the strings as Prolog strings, and use library(pcre), that might be faster than other methods.

Finally, if you want to use a DCG for the pattern recognition, you can use the following code to "match anywhere":

match_dcg(G) --> string(_), call(G), remainder(_).

This now completely ignores what comes before and after, or the position of the match within the list. You can use it like this, with your own open_rep//2:

?- aggregate_all(
       count,
       phrase(match_dcg(open_rep(n, 1)), [v,n,v,n,v,v,v,v,v,v,n,v,v]),
       N).
N = 3.

The difference between match_dcg//1 as defined here and the match//1 in the example in the docs to phrase_from_file/2 is that match_dcg//1 takes another DCG as an argument, not a pattern (substring).

is it `re_match_count('vn{1}v', vnvnvvvvvvnvv, N), N=2`? the OP has `N=3`. as does your `aggregate_all(count, sub_atom(vnvnvvvvvvnvv, _,_,_, vnv), N)`. — Will Ness, Mar 06 '23 at 09:38
@WillNess Now I think I got it. Thank you for pointing out the mistake! — TA_intern, Mar 06 '23 at 10:18
I ended using something else, but I have found you answer very instructive! Thank you! I have tried using regular expressions but my SWI-Prolog setup doesn't seem to have any regex library... Do you know how I can have access to them? `:- use_module(library(pcre)).` doesn't do the trick... — René Chenard, Mar 10 '23 at 00:19
@RenéChenard It should just work. You can install SWI-Prolog in many ways, you should say how you did it. I use the precompiled binaries on MacOS and I compile from source on Linux, both methods work for me. — TA_intern, Mar 10 '23 at 05:11
`SWI-Prolog version 9.0.4 for arm64-darwin` I have installed it using homebrew. — René Chenard, Mar 10 '23 at 13:26
@RenéChenard Homebrew might be the problem. Not sure if you get everything with the same package, maybe you need an extra package for PCRE. I have been going here: https://www.swi-prolog.org/download/devel and downloading the image with the relocatable application bundle that has the fat binaries for both intel and amd64. It is "manual work" (downloading from browser, dragging, dropping...) but at least I know I get everything, and it works as it should. On Linux on the other hand it is easier to just compile it yourself. — TA_intern, Mar 10 '23 at 13:36

Will Ness · Answer 2 · 2023-05-10T15:17:53.733

3

Applying the empirical orders of growth analysis, you can see that your code exhibits quadratic behavior while the various answers' code variants are linear.

Why is that?

It is because you insist on forcing the infix to start at a known index, increasing the index one by one. Then the infix string is found by the pair of appends, each time starting afresh from the input list's very beginning.

Thus the work is done in the classic triangular fashion (as is also illustrated in the linked entry), causing the problematic quadratic behavior.

But the two appends will do this on their own, moreover restarting at the last processed index and not from the index 0. Instead of starting from 0 and going along the input list to the given index i they will just continue from i directly; thus having the linear execution time.

To achieve this with your code, you just need to delete from it every bit dealing with the index manipulations, and it will magically turn into the efficient variant of what you see in the answers.

This gets you

count_occurrences(Pattern, List, Count) :-
    phrase(Pattern, PatternList),
    /*length(PatternList, PatternLength),
    length(List, ListLength),
    MaxIndex is ListLength - PatternLength,*/
    findall(1, (
        /*between(0, MaxIndex, Index),*/
        sublist(List, /*Index, PatternLength,*/ PatternList)
    ), Counts),
    sum_list(Counts, Count).

sublist(List, /*Index, Length,*/ Sublist) :-
    /*length(Sublist, Length),*/
    append(_Prefix, Rest, List),
    /*length(Prefix, Index),*/
    append(Sublist, _, Rest).

which checks out linear, as expected:

167 ?- random_list(1000,L),time(count_occurrences(open_rep(n,3),L,C)).
% 3,094 inferences, 0.000 CPU in 0.000 seconds (?% CPU, Infinite Lips)
L = [v, n, n, v, v, v, n, n, n|...],
C = 24.

168 ?- random_list(10000,L),time(count_occurrences(open_rep(n,3),L,C)).
% 30,514 inferences, 0.000 CPU in 0.003 seconds (0% CPU, Infinite Lips)
L = [n, n, v, v, v, v, v, n, v|...],
C = 280.

Tested using random_list/2 from another answer here.

Now you can fuse your calls to findall and sum_list into one predicate count as seen in that answer. Be sure to test both approaches though, with your actual data. Using library functions might still be faster. So, do also try the aggregate_all approach from TA_intern's answer.

Demand and control less, achieve more! Or,

Premature ~~optimization~~ implementation is the root of all evil.

edited May 10 '23 at 15:17

answered Mar 08 '23 at 15:33

Will Ness

70,110
9
98
181

1

Great answer! Small addition: depending on the pattern that has to be matched, there might be better methods than what sub_atom or the two appends do. For example, the [KMP algorithm](https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm) is more efficient than a linear search restarting at one after the failed match, and has better worse-case performance. Doesn't work for a DCG matching, though, at least not without some clever adjustments. – TA_intern Mar 09 '23 at 13:49
@TA_intern thanks, :) Of course, but getting linear is greatest ROI here as it is so easy. The "empirical orders of growth" approach is something I value greatly. (Have you looked at the linked entry? It might be a fun read :) ) – Will Ness Mar 09 '23 at 17:11
I admit I just assumed what it means. Now I did read it :-) – TA_intern Mar 09 '23 at 20:27
I ended up using a sliding window approach on a clumped array which I believe should be of linear complexity... The result is great! It's way much faster than the first solution I have found! I'll publish my finished implementation of Gomoku very soon. I'll let you know if you are interested. – René Chenard Mar 10 '23 at 00:26
2

@RenéChenard no, don't use RLE "clumped". it is a wrong approach which doesn't always work, as the example [in my comment](https://stackoverflow.com/questions/75630809/sliding-window-method-for-pattern-recognition-in-swi-prolog/75675068?noredirect=1#comment133460381_75634558) shows. for n-1 it finds and counts "n" instead of "vnv" as you wanted. – Will Ness Mar 10 '23 at 10:12
@RenéChenard ... so it will report wrong results whenever the input starts or ends with `n`. – Will Ness Mar 10 '23 at 10:32
1

Considering a pattern of length *m* and a list of length *n*, in the worst case, the time complexity of the algorithm is *O* ( *m* × *n*). So the complexity will be linear only for *m* constant. For *m* variable and much smaller than *n* (i.e., m ≪ n), the complexity will be almost linear. Otherwise, for *m* variable and close to *n* (i.e., *m* ≈ *n*), the complexity will still be quadratic. – slago Mar 10 '23 at 13:44
By taking advantage of `clumped` first, I was able to limit the sliding window size to 5 using a predicate with the signature `sum_score(Player, PS, [S1-N1, S2-N2, S3-N3|Rest], PreviousScore, Score)`, where `PS` is the previous symbol and `NS` is the next symbol for `Rest = [NS|_]` which is accessed when possible. I believe it is a proper sliding window on a RLE list instead of the list itself. Since both, `clumped` and `sum_score` are linear in complexity, I believe the whole thing has linear complexity. – René Chenard Mar 10 '23 at 15:22
1

@RenéChenard Your question is much more specific than what the title or the content of the question suggest (only in the comments it becomes clear what you are doing). You can still avoid using clumped/2 and then "pattern match"; those are two passes of a list, and you really need only one. I still suspect a regex might be by far the fastest in practice. If you insist on working on the list, you could use the same approach as clumped/2 (see the implementation!) but do your work in one pass. – TA_intern Mar 10 '23 at 17:56
1

@RenéChenard The KMP algorithm uses two passes; but it deals with any pattern. Your pattern is simple enough to not need the first pass to know how far ahead it can skip. – TA_intern Mar 10 '23 at 17:59
@slago your thesis (m=n => quadratic) did *not* pan out when I tried it, empirically, at the SWI REPL, with the test queries as shown in the answer. It was still linear. one reason I can think of is that usually pretty much every pattern match will fail pretty early, so it's not `m` in the `m X n` but a small constant. unless of course you have a monolithic string of just one character repeated `n` times and the *same* character repeated `m` times, as the pattern (and that, indeed, tested as quadratic) -- which is definitely not the case here, as the repeated char is adorned with `v`s. so, – Will Ness Mar 11 '23 at 16:16
@slago ... so, the only way I could indeed see quadratic run time was with `length(L,1000),maplist(=(v),L),time(count_occurrences(open_rep(v,1000),L,C)).` and then with both `1000` changed to `2000` etc. – Will Ness Mar 11 '23 at 16:17
@RenéChenard just test the various code variants, and compare the run times. be sure to test `vnvnvn` and `nvn` with the `open_rep(n,1)` pattern, and make sure the results are 2 and 0, respectively. – Will Ness Mar 11 '23 at 16:35
@WillNess That is not *my* thesis, that is the complexity of the algorithm for *arbitrary* patterns and lists. To confirm this fact, try the queries `?- length(L,1000), maplist(=(v),L), length(P,900), maplist(=(v),P), time(count(L, P,0,C)).` and `length(L,2000), maplist(=(v),L), length(P,1800), maplist(=(v),P), time(count(L ,P,0,C)).`. You will see that by doubling the input size, the execution time is approximately four times longer. Anyway, for the specific type of pattern considered in the question, perhaps this worst case does not occur.. – slago Mar 11 '23 at 22:45
@slago yes, I've already shown these examples in my comment, and indeed this is inapplicable to the current question, but only to this very special corner case. in the general random case, each pattern matching will fail early and `m` will be very very small on average, whatever the length of the pattern. – Will Ness Mar 11 '23 at 22:51

score 2 · Answer 3 · answered Mar 03 '23 at 22:28

I think that, even using the predicate append/3, you can still have an efficient solution (at least that's what can be observed with swi-prolog).

count(Pattern, List, Count) :-
    phrase(Pattern, Sublist),
    count(List, Sublist, 0, Count).

count([], _, Count, Count).
count([X|Xs], Sublist, Count0, Count) :-
    (   append(Sublist, _, [X|Xs])
    ->   Count1 is Count0 + 1
    ;    Count1 is Count0 ),
    count(Xs, Sublist, Count1, Count).

% To compare the two versions with large lists

random_list(0, []) :- !.
random_list(N, [X|Xs]) :- 
    random_member(X, [n,v]), 
    M is N-1, 
    random_list(M, Xs).

Comparison of count/3 and count_occurrences/3, using swi-prolog:

?- random_list(1000,L), time(count(open_rep(n,1),L,C1)), time(count_occurrences(open_rep(n,1),L,C2)).
% 4,648 inferences, 0.000 CPU in 0.000 seconds (0% CPU, Infinite Lips)
% 3,003,226 inferences, 1.578 CPU in 1.683 seconds (94% CPU, 1903034 Lips)
L = [v, n, v, v, n, n, n, n, v|...],
C1 = C2, C2 = 116.

?- random_list(2000,L), time(count(open_rep(n,1),L,C1)), time(count_occurrences(open_rep(n,1),L,C2)).
% 9,288 inferences, 0.000 CPU in 0.000 seconds (0% CPU, Infinite Lips)
% 12,006,431 inferences, 11.812 CPU in 12.421 seconds (95% CPU, 1016417 Lips)
L = [n, n, v, v, n, n, n, v, n|...],
C1 = C2, C2 = 229.

?- random_list(1000000,L), time(count(open_rep(n,3),L,C1)).
% 4,905,979 inferences, 0.109 CPU in 0.146 seconds (75% CPU, 44854665 Lips)
L = [n, v, v, v, n, v, n, n, n|...],
C1 = 31246.

?- L = [v,n,v,n,v,v,v,v,v,v,n,v,v], time(count(open_rep(n,1),L,C1)), time(count_occurrences(open_rep(n,1),L,C2)).
% 77 inferences, 0.000 CPU in 0.000 seconds (0% CPU, Infinite Lips)
% 571 inferences, 0.000 CPU in 0.000 seconds (0% CPU, Infinite Lips)
L = [v, n, v, n, v, v, v, v, v|...],
C1 = C2, C2 = 3.

?- L = [v,n,v,n,v,v,v,v,v,v,n,n,n,v,v], time(count(open_rep(n,3),L,C1)), time(count_occurrences(open_rep(n,3),L,C2)).
% 99 inferences, 0.000 CPU in 0.000 seconds (0% CPU, Infinite Lips)
% 641 inferences, 0.000 CPU in 0.000 seconds (0% CPU, Infinite Lips)
L = [v, n, v, n, v, v, v, v, v|...],
C1 = C2, C2 = 1.

?- L = [v,n,v,n,v,v,v,v,v,v,n,n,n,v,v], time(count(open_rep(n,1),L,C1)), time(count_occurrences(open_rep(n,1),L,C2)).
% 86 inferences, 0.000 CPU in 0.000 seconds (0% CPU, Infinite Lips)
% 739 inferences, 0.000 CPU in 0.000 seconds (0% CPU, Infinite Lips)
L = [v, n, v, n, v, v, v, v, v|...],
C1 = C2, C2 = 2.

I ended up using clumped, as proposed by CapelliC, combined with a modification of your solution. This is now a proper sliding window method and it is working perfectly! Thank you! — René Chenard, Mar 10 '23 at 00:12

CapelliC · Accepted Answer · 2023-03-06T11:00:22.523

IMO your approach is too much specific, and it will be sub optimal from the (re)usability viewpoint. SWI-Prolog offers a library predicate that performs RLE (run length encoding), as I discovered from this interesting topic, and is worth to try: here I'll post a module, where I copied your code, and an alternative that uses clumped/2:

/*  File:    x_so_sliding_window.pl
    Author:  Carlo
    Created: Mar  4 2023
    Purpose: https://stackoverflow.com/q/75630809/874024
*/

:- module(x_so_sliding_window,
          [count_occurrences/3
          ,open_rep//2
          ,count_by_clumped/3
          ]).


count_occurrences(Pattern, List, Count) :-
    phrase(Pattern, PatternList),
    length(PatternList, PatternLength),
    length(List, ListLength),
    MaxIndex is ListLength - PatternLength,
    findall(1, (
        between(0, MaxIndex, Index),
        sublist(List, Index, PatternLength, PatternList)
    ), Counts),
    sum_list(Counts, Count).

sublist(List, Index, Length, Sublist) :-
    length(Sublist, Length),
    append(Prefix, Rest, List),
    length(Prefix, Index),
    append(Sublist, _, Rest).

rep(S, L) --> [], {L is 0} | [S], {L is 1} | [S], { L_1 is L - 1 }, rep(S, L_1), !.
open_rep(S, L) --> [v], rep(S, L), [v], !.


count_by_clumped(Pattern,List,Count) :-
    clumped(List, R),
    aggregate_all(count, member(Pattern,R), Count).

then I had this code in x_so_sliding_window.plt

:- [x_so_sliding_window].

t(Count) :- count_occurrences(open_rep(n, 1), [v,n,v,n,v,v,v,v,v,v,n,v,v], Count).
t(Count) :- count_occurrences(open_rep(n, 3), [v,n,v,n,v,v,v,v,v,v,n,n,n,v,v], Count).
t(Count) :- count_occurrences(open_rep(n, 1), [v,n,v,n,v,v,v,v,v,v,n,n,n,v,v], Count).

c(Count) :- count_by_clumped(n-1, [v,n,v,n,v,v,v,v,v,v,n,v,v], Count).
c(Count) :- count_by_clumped(n-3, [v,n,v,n,v,v,v,v,v,v,n,n,n,v,v], Count).
c(Count) :- count_by_clumped(n-1, [v,n,v,n,v,v,v,v,v,v,n,n,n,v,v], Count).

that shows at first glance the equivalence between t/1 and c/1 calls:

?- t(Count).
Count = 3 ;
Count = 1 ;
Count = 2.

?- c(Count).
Count = 3 ;
Count = 1 ;
Count = 2.

and, at last, the 'benchmark' directly coded in the REPL:

?- time((between(1,1000,_),(t(_),fail))).
% 1,953,001 inferences, 0.234 CPU in 0.234 seconds (100% CPU, 8332804 Lips)
false.
?- time((between(1,1000,_),(c(_),fail))).
% 123,001 inferences, 0.000 CPU in 0.014 seconds (0% CPU, Infinite Lips)
false.

isn't it `count_by_clumped(n-1, [v,n,v,n,v,v,v,v,v,v,n], 3)` and `count_occurrences(open_rep(n, 1), [v,n,v,n,v,v,v,v,v,v,n], 2)` ? — Will Ness, Mar 06 '23 at 09:41
I ended up using a combination of what you proposed and what slago proposed: a mix of clumped combined with a linear search. This effectively gave me a sliding window approach and the speed of computation was greatly improved! Thank you! — René Chenard, Mar 10 '23 at 00:07

score 1 · Answer 5 · answered Mar 04 '23 at 14:33

Here's a better method for a sublist:

sublist1([H|T], Index, Length, Sublist) :-
    sublist1_start_(T, H, 1, Index, Length, Sublist).

% P = previous
sublist1_start_(L, P, Ind, Index, Length, Sublist) :-
    sublist1_loop_(L, P, Ind, Index, 1, Length, Sublist).
sublist1_start_([H|T], _, Ind, Index, Length, Sublist) :-
    Ind1 is Ind + 1,
    % Go to next element
    sublist1_start_(T, H, Ind1, Index, Length, Sublist).

sublist1_loop_(_, H, Ind, Ind, Len, Len, [H]).
sublist1_loop_([H|T], P, Ind, Index, Len, Length, [P|Sublist]) :-
    Len1 is Len + 1,
    sublist1_loop_(T, H, Ind, Index, Len1, Length, Sublist).

Results in swi-prolog:

?- sublist1([a,b,c], Ind, Len, Sub).
Ind = Len, Len = 1,
Sub = [a] ;
Ind = 1,
Len = 2,
Sub = [a, b] ;
Ind = 1,
Len = 3,
Sub = [a, b, c] ;
Ind = 2,
Len = 1,
Sub = [b] ;
Ind = Len, Len = 2,
Sub = [b, c] ;
Ind = 3,
Len = 1,
Sub = [c].

If you want the indexing to start from 0 rather than 1, then change the 1 to 0 in the 2nd line.

I've excluded empty lists.

Thank you for your answer! I have ended up using something totally different which uses clumped and I don't know if it can get any better. I think my new solution has a linear complexity... Anyway, it's really fast now! I'll publish my implementation of Gomoku very soon if you are interested. — René Chenard, Mar 10 '23 at 00:32

Sliding window method for pattern recognition in SWI-Prolog

5 Answers5

Linked