How to count coincidences on each character of two large strings without triggering the Out of Local Stack exception?

Question

I need a clause that counts char coincidences between two large strings but omitting '_' coincidences. I have this code:

fit(GEN1, GEN2, N, N) :-
   length(GEN1, L1),
   length(GEN2, L2),
   0 is L1*L2.

fit([P1|R1], [P2|R2], N, TOTAL) :-
   member(P1, ['_',a,c,t,g]),
   member(P2, ['_',a,c,t,g]),
   append([P1],[P2],T),
   (  member(T,[[a,a],[c,c],[t,t],[g,g]])
   -> X is N+1
   ;  X is N
   ),
   fit(R1,R2,X,TOTAL).

Where GEN1 and GEN2 are lists containing all characters large strings.

I've tried increasing the stack limit to avoid Out of Local Stack exception with little success.

The issue is that, is called often and in deep recursive clauses. Is there any better way to do this?

EDIT

The clause needs to stop when one or both lists are empty.

EDIT 2

Is worth saying that testings on all answers below were done using 64bit prolog, with the --stack-limit=32g option as my code isn't well optimized and the fit clause is a small part of a larger process, but was the main problem with my code.

EDIT 3

CapelliC code worked using the less resources.

false code using the library(reif) v2 worked the faster.

See Complexity of counting matching elements in two sequences using library(aggregate) for more proposed solutions.

Add an example query to produce the overflow – false Apr 26 '21 at 18:17 — false, Apr 26 '21 at 18:17

false · Answer 1 · 2021-04-27T19:43:34.497

5

It seems that there is no point to insist that you have letters out of "_actg" all the time. A generalized definition seems to be sufficient. Using library(reif):

fit([], _, N,N).
fit([_|_], [], N,N).
fit([P1|R1], [P2|R2], N,TOTAL) :-
   if_( ( P1 = P2, dif(P1, '_') ), X is N+1, X = N ),
   fit(R1, R2, X,TOTAL).

Update: please make sure to use v2 of library(reif). The original version did not compile dif/3.

And here a version for systems that can only index on one argument simultaneously:

fit([], _, N,N).
fit([P1|R1], L2, N,TOTAL) :-
    ifit(L2, [P1|R1], N,TOTAL).

ifit([], _, N,N).
ifit([P2|R2], [P1|R1], N,TOTAL) :-
   if_( ( P1 = P2, dif(P1, '_') ), X is N+1, X = N ),
   fit(R1, R2, X,TOTAL).

edited Apr 27 '21 at 19:43

answered Apr 26 '21 at 18:35

false

10,264
13
101
209

That should be pretty fast indeed. But can't we just `fit([P|R1],[P|R2], N, Total) :- P \== '_', !, X is N+1, fit(R1,R2, X, TOTAL). fit([_|_],[_,_], N, N).`? – David Tonhofer Apr 26 '21 at 19:02
@David: You may want to test for groundness in the beginning, but then you could never user any form of declarative diagnosis. – false Apr 26 '21 at 19:05
@RezzDeWitt Do you have `library(reif)` from Markus Triska's [Metalevel Predicates](https://www.metalevel.at/prolog/metapredicates) page? Under "if_/3 and related predicates". Sadly, the server which hosts it seems down: http://www.complang.tuwien.ac.at/ulrich/Prolog-inedit/swi/reif.pl ... But here is the [snapshot from April 11](http://web.archive.org/web/20210411155326/complang.tuwien.ac.at/ulrich/Prolog-inedit/swi/reif.pl) – David Tonhofer Apr 27 '21 at 05:50
@DavidTonhofer, i do, do you need it? The 'false' code worked fine, it allowed me to use the algorithm with more iterations and more population (long strings and how are they similar in this case) without the **Out of local stack** exception. **EDIT** Correction, i have the `library(reif)` from Ulrich Neumerkel. – RezzDeWitt Apr 27 '21 at 05:57
It is working, but i am testing other solutions regarding this before selecting a answer as correct. – RezzDeWitt Apr 27 '21 at 06:01
1

I have allowed myself to add a few "plunit" test cases to "false"'s answer for illustration. – David Tonhofer Apr 27 '21 at 06:04
@RezzDeWitt: There is a [new version (v2)](http://www.complang.tuwien.ac.at/ulrich/Prolog-inedit/swi/reif.pl) of `library(reif)` which now compiles dif/2 as well. Check it with `listing(fit)`: there must not be any error handling in it. – false Apr 27 '21 at 08:44
@false Stackkyflow doesn't allow "somewhere else", more's the pity. One of the points of your code was to do away with necessitating "atcg" wasn't it? And what's wrong with my spacing? – David Tonhofer Apr 27 '21 at 15:08
1

@DavidTonhofer: Note the "Post Your Answer" button. And for your spacing: You use an empty line between all clauses, so one cannot tell, if the next clause belongs to the same predicate. – false Apr 27 '21 at 16:24
@false But it isn't really an answer, so that wouldn't be appropriate. SO should have a "note" feature... Note that, within the "plunit" module block, all the predicates (except for helpers) are either `test/1` or `test/2`. They are just individual test cases, not really clauses of the same predicate. Anyway, feel free to roll back. – David Tonhofer Apr 27 '21 at 16:41
1

@DavidTonhofer: There is a reason why I do not like the current plunit, and this "looks like a rule but it isn't one" thing nails it. – false Apr 27 '21 at 16:47
@false Ah. I have worked for years with junit, so it's not really shocking. Kinda like a DSL in Racket. – David Tonhofer Apr 27 '21 at 17:17
@false code using the new version of ``library(reif)`` worked quite fast, still using a moderate amount of resources. – RezzDeWitt Apr 27 '21 at 18:59
@RezzDeWitt: Seems SWI cannot index on multiple arguments. Please try the latest version above! – false Apr 27 '21 at 19:44

score 2 · Accepted Answer · edited Apr 28 '21 at 14:19

2

if your Prolog has library(aggregate) you can do

fit(GEN1, GEN2, N) :-
  aggregate_all(count, (nth1(P,GEN1,S),nth1(P,GEN2,S),memberchk(S,[a,c,g,t])), N).

edit

Depending on the statistic of data, a noticeable improvement can be obtained just swapping the last two calls, i.e. ...(nth1(P,GEN1,S),memberchk(S,[a,c,g,t]),nth1(P,GEN2,S))...

edit

Of course a tight loop it's better that a double indexed scan. For performance, I would write it like

fit_cc(GEN1, GEN2, N) :-
  fit_cc(GEN1, GEN2, 0, N).

fit_cc([X|GEN1], [Y|GEN2], C, N) :-
  (  X\='_' /*memberchk(X, [a,c,g,t])*/, X=Y
  -> D is C+1 ; D=C
  ),
  fit_cc(GEN1, GEN2, D, N).
fit_cc(_, _, N, N).

but the generality and correctness allowed by library(reif) v2, as seen in @false' answer and comments, seems to be well worth the (pretty small) overhead.

edited Apr 28 '21 at 14:19

David Tonhofer

14,559
5
55
51

answered Apr 27 '21 at 05:33

CapelliC

59,646
5
47
90

That sounds like an O(length(list)²) algorithm though (lists will be lists). Will that be acceptable? – David Tonhofer Apr 27 '21 at 05:53
1

Several runs using CapelliC's code worked amazingly well, using **a lot less resources** while taking about the same time than false code. – RezzDeWitt Apr 27 '21 at 06:08
@RezzDeWitt I guess zipping down lists in C is pretty fast. Also it's just O(length(first list)) really. My bad. – David Tonhofer Apr 27 '21 at 07:14
2

Note that `fit([a],[B], 1), B = t.` succeeds, yet `B = t, fit([a],[B], 1).` fails. – false Apr 27 '21 at 08:12
@false I wasn't using the correct call, i just deleted the comment. – RezzDeWitt Apr 27 '21 at 19:02
@DavidTonhofer why "bad"? why _wouldn't_ it be quadratic? on the face of it, it should. isn't it? I've just clicked the link in this answer, I didn't see any guarantees for deep introspection and reinterpretation of goals there. do we really know the second `nth1` isn't searching from the top anew on each invocation? ( @ CapelliC as well). – Will Ness Apr 28 '21 at 08:19
@WillNess I added a sidenote of my idea. – David Tonhofer Apr 28 '21 at 10:19
your sum is quadratic, but that aside, the problem is the second nth1, not the first. the first can be coded as a linear generator in case P was var, that's easy to imagine. the 2nd nth1 -- if used as an opaque predicate -- can't possibly know that the next P it'll receive will be exactly (1+) of the previous one, and the 2nd arg will be the same. so either aggregate does its own interpretation for such cases, or nth1 has lots of internal state and somehow detects it's safe to advance the previously used pointer -- *not* the argument it receives -- one notch and use _that_. @DavidTonhofer – Will Ness Apr 28 '21 at 12:15
very dubious. i.e. for the first nth1 call in the aggregate's goal to take O(1) on each redo to produce the next result is plausible; but for the 2nd nth1 to do the same, _that_, I don't see how it can be, except aggregate doing its own interpretation for itself and in effect merging the two calls into one. @DavidTonhofer (meta: I appreciate your feedback but I feel uneasy with us commandeering CapelliC's answer like this; this is not how SO is usually supposed to be operating...) – Will Ness Apr 28 '21 at 12:19
@WillNess It still don't see the problem. That sum is linear?? We should open a new question. Let me do that. – David Tonhofer Apr 28 '21 at 12:58
@DavidTonhofer yes please do. the sum `1 + 2+ 3 + ... + n == n(n+1)/2` is quadratic, isn't it? – Will Ness Apr 28 '21 at 13:00
@WillNess You are right. I must be getting old. A question has been opened: https://stackoverflow.com/questions/67301959/complexity-of-counting-matching-elements-in-two-sequences-using-libraryaggrega – David Tonhofer Apr 28 '21 at 14:20

score 0 · Answer 3 · answered Apr 28 '21 at 12:35

In case you always call your predicate with two first arguments already fully instantiated, so you use it as a function, not as a relation -- which it seems like you do indeed -- I suspect that just adding !, at the start of your very last line of code should be enough to remove the stack overflow.

To do a little bit better, we'd use memberchk instead of member and notice that append([A],[B],C) is exactly the same thing as C = [A,B]; so after a little bit of reshufflling we end up with something like

fit( [], [], N, N).

fit( [P1|R1], [P2|R2], N, TOTAL) :-
   memberchk( P1, [a,c,t,g]),
   (  P2 == P1
   -> X is N+1
   ;  X is N
   ),
   %%  !,    %% might need the cut
   fit( R1, R2, X, TOTAL).

and we might not even need that cut since memberchk is already deterministic.

(not tested, though)

How to count coincidences on each character of two large strings without triggering the Out of Local Stack exception?

3 Answers3