Complexity of counting matching elements in two sequences using `library(aggregate)`

Question

We want to count the correspondences between two (possibly long) strings which happen to represent DNA sequences. The sequences are lists-of-chars where the char is taken from a,c,t,g,'_', with the '_' a "don't know" placeholder which never corresponds to anything, even itself. In this case, we employ library(aggregate) (thanks to CapelliC for the idea):

match(Seq1,Seq2,Count) :-
   aggregate_all(count,
      (
         nth1(Pos,Seq1,X),
         nth1(Pos,Seq2,X),
         memberchk(X,[a,c,g,t])
      ),
      N).

This approach can be compared to a "straightforward" approach where one would set up a (tail-recursive) recursion that just walks down both sequences in tandem and compares elements pairwise, counting as it goes.

As the sequences can be very large, algorithmic complexity becomes of some interest.

One would expect, with n = length(sequence) and both sequences the same length:

Straightforward approach: complexity is O(n)
aggregation approach: complexity is O(n²)

What is the (time and maybe space) complexity of the above algorithm and why?

Test code

To complement the above, an SWI-Prolog based plunit test code block:

:- begin_tests(atcg).

wrap_match(String1,String2,Count) :-
   atom_chars(String1,Seq1),
   atom_chars(String2,Seq2),
   fit(Seq1,Seq1,0,Count).

test("string 1 empty",nondet) :-
   wrap_match("atcg","",Count), 
   assertion(Count == 0).
      
test("string 2 empty") :-
   wrap_match("","atcg",Count), 
   assertion(Count == 0).

test("both strings empty") :-
   wrap_match("","",Count), 
   assertion(Count == 0).

test("both strings match, 1 char only") :-
   wrap_match("a","a",Count),   
   assertion(Count == 1).

test("both strings match") :-
   wrap_match("atcgatcgatcg","atcgatcgatcg",Count),   
   assertion(MatchCount == 12).

test("both strings match with underscores") :-
   wrap_match("_TC_ATCG_TCG","_TC_ATCG_TCG",Count),   
   assertion(MatchCount == 9).

test("various mismatches 1") :-
   wrap_match("atcgatcgatcg","atcgatcgatcg",Count),   
   assertion(MatchCount == 8).

test("various mismatches with underscores") :-
   wrap_match("at_ga_cg__cg","atcgatcgatcg",Count),   
   assertion(Count == 8).
   
:- end_tests(atcg).

And so:

?- run_tests.
% PL-Unit: atcg ........ done
% All 8 tests passed
true.

It's important to note that in the question that gave rise to this thread, the sequences could be of different length. Hence my double nth1/2 usage... — CapelliC, Apr 29 '21 at 09:45

David Tonhofer · Answer 1 · 2021-04-28T14:24:47.470

3

Empirical info

After some manual data collection (something that cries out for automatization) using the code below, which outputs time elapsed and number of inferences made to the console:

gimme_random_sequence(Length,Seq) :-
   length(Seq,Length),
   maplist(
      [E]>>(random_between(0,3,Ix),nth0(Ix,[a,t,c,g],E)),
      Seq).
           
how_fast(Length) :-
   gimme_random_sequence(Length,Seq1),
   gimme_random_sequence(Length,Seq2),
   time(match(Seq1,Seq2,_)).

... and a bit of graph fumbling in LibreOffice Calc (my ggplot skills are rusty), we have empirical data that this algorithm's cost is

O((length(sequence))²).

Count,Inferences,Seconds,milliseconds,megainferences
1000,171179,0.039,39,0.171179
2000,675661,0.097,97,0.675661
3000,1513436,0.186,186,1.513436
4000,2684639,0.327,327,2.684639
5000,4189172,0.502,502,4.189172
6000,6027056,0.722,722,6.027056
7000,8198103,1.002,1002,8.198103
8000,10702603,1.304,1304,10.702603
9000,13540531,1.677,1677,13.540531
10000,16711607,2.062,2062,16.711607
11000,20216119,2.449,2449,20.216119
20000,66756619,8.091,8091,66.756619
30000,150134731,17.907,17907,150.134731
40000,266846773,32.012,32012,266.846773
50000,416891749,52.942,52942,416.891749
60000,600269907,74.103,74103,600.269907

edited Apr 28 '21 at 14:24

answered Apr 28 '21 at 14:17

David Tonhofer

14,559
5
55
51

Did you loaded `library(apply_macros)` to eliminate the meta-call and lambda expression overheads? – Paulo Moura Apr 28 '21 at 15:01
2

please always do such plots at [log-log scale](http://en.wikipedia.org/wiki/Analysis_of_algorithms#Empirical_orders_of_growth) so it becomes precise. :) – Will Ness Apr 28 '21 at 15:19
@PauloMoura Ah... no. But that would be a constant multiplier. I will have to retry after getting time/1 to output machine-readable data. – David Tonhofer Apr 29 '21 at 09:23
@WillNess Good advice. I wanted to fit an polynomial (there isa LibreOffice add-on for that), but I wasn't ready to go through that tutorial just yet. – David Tonhofer Apr 29 '21 at 09:25
@DavidTonhofer yes, we've got something even better -- vision, :) the massively parallel super-performant pattern recognition sub-system of human cognition. you just plot your data there, hand-draw a straight line that feels like it's the closest, and we immediately see what's going on. (one example is https://stackoverflow.com/a/30116605/849891 and thereabout) – Will Ness Apr 29 '21 at 13:25
1

so the conclusion is, it _is_ quadratic; `aggregate` does _not_ do its own interpretation automagically fusing (joining together) the two *sequential* `nth1`s for us, and we have to do it ourselves, ending up with `member/4`, which is two `member/2`s working in unison over the two input lists, processing them *"in parallel"*. – Will Ness Apr 30 '21 at 09:37

score 3 · Answer 2 · 2021-04-30T14:04:13.197

Never ever use functional programming idioms in Prolog that avoid backtracking, like maplist/4. This here, pair_member/4 and match3/3, should be a tick faster.

match2(Seq1, Seq2, Count) :-
   (   maplist([X,Y,X-Y]>>true, Seq1, Seq2, Seq3)
   ->  aggregate_all(count, (member(X-X, Seq3), X\='_'), Count)
   ;   Count = 0 ).

pair_member(X, Y, [X|_], [Y|_]).
pair_member(X, Y, [_|L], [_|R]) :-  
    pair_member(X, Y, L, R).

match3(Seq1, Seq2, Count) :-
   aggregate_all(count,
         (pair_member(X, X, Seq1, Seq2), X \= '_'), Count).

gimme_random_sequence(Length, Seq) :-
   length(Seq, Length),
   maplist([E]>>(random_between(0,3,Ix), nth0(Ix, [a,t,c,g], E)), Seq).

test(N) :-
   gimme_random_sequence(N, S1),
   gimme_random_sequence(N, S2),
   time(match2(S1, S2, Count)),
   time(match3(S1, S2, Count)).

Woa! Its 10x times faster! Thanks to genius of SWI-Prolog how it
compiles the tail recursion in pair_member/4:

/* SWI-Prolog 8.3.21, MacBook Air 2019 */
?- set_prolog_flag(double_quotes, chars).
true.

?- X = "abc".
X = [a, b, c].

?- match2("_TC_ATCG_TCG","_TC_ATCG_TCG",Count).
Count = 9.

?- match3("_TC_ATCG_TCG","_TC_ATCG_TCG",Count).
Count = 9.

?- test(100000).
% 1,575,520 inferences, 0.186 CPU in 0.190 seconds (98% CPU, 8465031 Lips)
% 175,519 inferences, 0.018 CPU in 0.019 seconds (98% CPU, 9577595 Lips)
true.

Edit 29.04.2021:
Oh the irony, bifurcation backtracking is nevertheless challenging.
After fixing a misuse of library(apply_macros), I get:

?- test(100000).
% 374,146 inferences, 0.019 CPU in 0.019 seconds (99% CPU, 19379778 Lips)
% 174,145 inferences, 0.014 CPU in 0.014 seconds (99% CPU, 12400840 Lips)
true.

Does native member/2 contribute to the good maplist solution performance?
But I should do a better measure, with larger times durations.

Open Source:

Sequence Match Problem
https://gist.github.com/jburse/9fd22e8c3e8de6148fbd341817538ef6#file-sequence-pl

"Never ever use functional programming idioms in Prolog that avoid backtracking". Wouldn't the poodle's kernel rather be: "Avoid building intermediary structures, use stream processing instead where backtracking helps you to keep the small relevant "window of data" on the stack while giving you a result in the first branch and going down further along the stream in the other branch." — David Tonhofer, Apr 29 '21 at 09:37
Misuse of `library(apply_macros)` means: 1) Load `library(apply_macros)`: `use_module(library(apply_macros))`, and then 2) enter the code again for compilation. — David Tonhofer, Apr 29 '21 at 09:44
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/231791/discussion-between-mostowski-collapse-and-will-ness). — , Apr 30 '21 at 14:07

slago · Accepted Answer · 2021-05-01T00:32:52.337

I think that it is interresting to observe that complexity O(n²) is not due to the aggregation approach itself, but to the fact that subgoal nth1(Pos,Seq1,X), nth1(Pos,Seq2,X) behaves as a "nested loop" (in the size n of the sequences).

Thus, it should be possible to create another algorithm that, even using aggregation, can have complexity O(n), as long as the "nested loop" is eliminated.

Algorithms to compare

% Original algorithm: Complexity O(n²)

match1(Seq1, Seq2, Count) :-
   aggregate_all(count,
      (  nth1(Pos, Seq1, X),
         nth1(Pos, Seq2, X),
         memberchk(X, [a,c,g,t]) ),
      Count).

% Proposed algorithm: Complexity O(n)

match2(Seq1, Seq2, Count) :-
   (   maplist([X,Y,X-Y]>>true, Seq1, Seq2, Seq3)
   ->  aggregate_all(count, (member(X-X, Seq3), X\='_'), Count)
   ;   Count = 0 ).

gimme_random_sequence(Length, Seq) :-
   length(Seq, Length),
   maplist([E]>>(random_between(0,3,Ix), nth0(Ix, [a,t,c,g], E)), Seq).

test(N) :-
   gimme_random_sequence(N, S1),
   gimme_random_sequence(N, S2),
   time(match1(S1, S2, Count)),
   time(match2(S1, S2, Count)).

Simple empirical results

?- test(10000).
% 16,714,057 inferences, 1.156 CPU in 1.156 seconds (100% CPU, 14455401 Lips)
% 39,858 inferences, 0.000 CPU in 0.000 seconds (?% CPU, Infinite Lips)
true.

?- test(20000).
% 66,761,535 inferences, 4.594 CPU in 4.593 seconds (100% CPU, 14533123 Lips)
% 79,826 inferences, 0.016 CPU in 0.016 seconds (100% CPU, 5108864 Lips)
true.

?- test(40000).
% 266,856,213 inferences, 19.734 CPU in 19.841 seconds (99% CPU, 13522405 Lips)
% 159,398 inferences, 0.016 CPU in 0.015 seconds (104% CPU, 10201472 Lips)
true.

?- test(80000).
% 1,067,046,835 inferences, 77.203 CPU in 77.493 seconds (100% CPU, 13821291 Lips)
% 320,226 inferences, 0.047 CPU in 0.047 seconds (100% CPU, 6831488 Lips)
true.

Edit 30/04/2021: Does nth1(I,S,X), nth1(I,S,X) really work as nested loop?

To see that the answer to this question is yes, consider the following simple implementation of nth/3, that counts the number of rounds needed to find each solution, using a global flag:

nth(Index, List, Item) :-
   (   var(Index)
   ->  nth_nondet(1, Index, List, Item)
   ;   integer(Index)
   ->  nth_det(Index, List, Item)
   ).

nth_det(1, [Item|_], Item) :- !.
nth_det(Index, [_|Rest], Item) :-
   flag(rounds, Rounds, Rounds+1),
   Index1 is Index - 1,
   nth_det(Index1, Rest, Item).

nth_nondet(Index, Index, [Item|_], Item).
nth_nondet(Acc, Index, [_|Rest], Item) :-
   flag(rounds, Rounds, Rounds+1),
   Acc1 is Acc + 1,
   nth_nondet(Acc1, Index, Rest, Item).

To get the number of rounds, you can ask:

?- flag(rounds,_,0), nth(5,[a,b,c,d,e],X), flag(rounds,Rounds,Rounds).
X = e,
Rounds = 4.

Now, using this predicate, we can create a predicate to count the number of rounds of the goal nth(I,L,X), nth(I,L,X), for lists of different lengths:

count_rounds :-
    forall(between(1, 10, N),
           ( Length is 10*N,
             count_rounds(Length, Rounds),
             writeln(rounds(Length) = Rounds)
           )).

count_rounds(Length, _) :-
    numlist(1, Length, List),
    flag(rounds, _, 0),
    nth(Index, List, Item),
    nth(Index, List, Item),
    fail.
count_rounds(_, Rounds) :-
    flag(rounds, Rounds, Rounds).

Empirical results:

?- count_rounds.
rounds(10) = 55
rounds(20) = 210
rounds(30) = 465
rounds(40) = 820
rounds(50) = 1275
rounds(60) = 1830
rounds(70) = 2485
rounds(80) = 3240
rounds(90) = 4095
rounds(100) = 5050

As we can see, the goal nth(I,L,X), nth(I,L,X) computes half of a square matrix of order n (including its diagonal). Thus, the number of rounds for a list of length n is rounds(n) = (n² + n)/2. Hence, the time complexity of this goal is O(n²).

Remark The implementation of the library predicate nth1/3 is a little more efficient than that of predicate nth/3considered for this experiment. Nevertheless, the time complexity of goal nth1(I,S,X), nth1(I,S,X)still is O(n²).

score 2 · Answer 4 · answered Apr 29 '21 at 09:27

This is a followup of @MostowskiCollapse answer, where I have applied the same optimization that Gertjan van Noord provided for member/2 to pair_member/4, but I have renamed it to member/4.

member(X, Y, [XH|XT], [YH|YT]) :-
  member_(XT, YT, X, Y, XH, YH).

member_(_, _, X,Y, X,Y).
member_([XH|XT],[YH|YT], X,Y, _,_) :-
  member_(XT,YT, X,Y, XH,YH).

match4(Seq1, Seq2, Count) :-
  aggregate_all(count,
      (member(X, X, Seq1, Seq2), X \= '_'), Count).

test(N) :-
  gimme_random_sequence(N, S1),
  gimme_random_sequence(N, S2),
  %time(match2(S1, S2, Count)),
  time(match3(S1, S2, Count)),
  time(match4(S1, S2, Count)).

...

with lists of length 1.000.000 I get

% 1,751,758 inferences, 0.835 CPU in 0.835 seconds (100% CPU, 2098841 Lips)
% 1,751,757 inferences, 0.637 CPU in 0.637 seconds (100% CPU, 2751198 Lips)

that is, a gain of about 25%...

I wonder how this speed-up relates to the Prolog VM of SWI-Prolog. Is it the clause indexing or something else? After all member_/6 has more arguments, but SWI-Prologs tail recursion seems to deal with it. — , May 01 '21 at 10:48
IIRC the gain is due to avoiding the double unpack of list head. So I'd say it's related to Prolog VM. — CapelliC, May 01 '21 at 17:27

Complexity of counting matching elements in two sequences using `library(aggregate)`

Test code

4 Answers4

Empirical info

Linked