We want to count the correspondences between two (possibly long) strings which happen to represent DNA sequences. The sequences are lists-of-chars where the char is taken from a,c,t,g,'_'
, with the '_' a "don't know" placeholder which never corresponds to anything, even itself. In this case, we employ library(aggregate) (thanks to CapelliC for the idea):
match(Seq1,Seq2,Count) :-
aggregate_all(count,
(
nth1(Pos,Seq1,X),
nth1(Pos,Seq2,X),
memberchk(X,[a,c,g,t])
),
N).
This approach can be compared to a "straightforward" approach where one would set up a (tail-recursive) recursion that just walks down both sequences in tandem and compares elements pairwise, counting as it goes.
As the sequences can be very large, algorithmic complexity becomes of some interest.
One would expect, with n = length(sequence) and both sequences the same length:
- Straightforward approach: complexity is O(n)
- aggregation approach: complexity is O(n²)
What is the (time and maybe space) complexity of the above algorithm and why?
Test code
To complement the above, an SWI-Prolog based plunit test code block:
:- begin_tests(atcg).
wrap_match(String1,String2,Count) :-
atom_chars(String1,Seq1),
atom_chars(String2,Seq2),
fit(Seq1,Seq1,0,Count).
test("string 1 empty",nondet) :-
wrap_match("atcg","",Count),
assertion(Count == 0).
test("string 2 empty") :-
wrap_match("","atcg",Count),
assertion(Count == 0).
test("both strings empty") :-
wrap_match("","",Count),
assertion(Count == 0).
test("both strings match, 1 char only") :-
wrap_match("a","a",Count),
assertion(Count == 1).
test("both strings match") :-
wrap_match("atcgatcgatcg","atcgatcgatcg",Count),
assertion(MatchCount == 12).
test("both strings match with underscores") :-
wrap_match("_TC_ATCG_TCG","_TC_ATCG_TCG",Count),
assertion(MatchCount == 9).
test("various mismatches 1") :-
wrap_match("atcgatcgatcg","atcgatcgatcg",Count),
assertion(MatchCount == 8).
test("various mismatches with underscores") :-
wrap_match("at_ga_cg__cg","atcgatcgatcg",Count),
assertion(Count == 8).
:- end_tests(atcg).
And so:
?- run_tests.
% PL-Unit: atcg ........ done
% All 8 tests passed
true.