In Prolog DCGs, how to remove over general solutions?

Question

I have a text file containing a sequence. For example:

GGGGGGGGAACCCCCCCCCCTTGGGGGGGGGGGGGGGGAACCCCCCCCCCTTGGGGGGGG

I have wrote the following DCG to find the sequence between AA and TT.

:- use_module(library(pio)).
:- use_module(library(dcg/basics)).
:- portray_text(true).

process(Xs) :- phrase_from_file(find(Xs), 'string.txt').

anyseq([]) -->[].
anyseq([E|Es]) --> [E], anyseq(Es).

begin --> "AA".
end -->"TT".

find(Seq) -->
     anyseq(_),begin,anyseq(Seq),end, anyseq(_).

I query and I get:

?- process(Xs).
 Xs = "CCCCCCCCCC" ;
 Xs = "CCCCCCCCCCTTGGGGGGGGGGGGG...CCCCC" ;
 Xs = "CCCCCCCCCC" ;
false.

But I dont want it to find the second solution or ones like it. Only the solutions between one pair of AA and TTs not all combinations. I have a feeling I could use string_without and string in library dcg basiscs but I dont understand how to use them.

Also any advice on naming these, I have find.. but it is not very declarative. — user27815, Jul 06 '15 at 13:42
What do you expect for the string `"AAAACCCTT"`? Is it `"AACCC"` or `"CCC"`? — false, Jul 06 '15 at 15:20
hmm .. good point.. not sure at the moment. The answer below finds three solutions.. that will do for the moment. Thanks for pointing this out. — user27815, Jul 06 '15 at 15:34

score 3 · Accepted Answer · edited Jul 06 '15 at 21:33

3

your anyseq//1 is identical to string//1 from library(dcg/basics), and shares the same 'problem'.

To keep in control, I would introduce a 'between separators' state:

elem(E) --> begin, string(E), end, !.

begin --> "AA".
end -->"TT".

find(Seq) -->
     anyseq(_),elem(Seq).

anyseq([]) -->[].
anyseq([E|Es]) --> [E], anyseq(Es).

process(Xs) :-
   phrase(find(Xs), `GGGGGGGGAACCCCCCCCCCTTGGGGGGGGGGGGGGGGAACCCCC+++CCCCCTTGGGGGGGG`,_).

now I get

?- process(X).
X = "CCCCCCCCCC" ;
X = "CCCCC+++CCCCC" ;
false.

note the anonymous var as last argument of phrase/3: it's needed to suit the change in 'control flow' induced by the more strict pattern used: elem//1 is not followed by anyseq//1, because any two sequences 'sharing' anyseq//1 would be problematic.

In the end, you should change your grammar to collect elem//1 with a right recursive grammar....

edited Jul 06 '15 at 21:33

false

10,264
13
101
209

answered Jul 06 '15 at 14:49

CapelliC

59,646
5
47
90

Thanks.. ill have a look at right recursive grammars.. I changed this to `process(Xs):- phrase_from_file(string(Whole_File),'string.txt'), phrase(find(Xs),Whole_File,_).` To get the file input.. is that a good way to do that? – user27815 Jul 06 '15 at 15:20
1

No, it defeats completely the purpose of phrase_from_file. While debugging, you can use [read_file_to_codes](http://www.swi-prolog.org/pldoc/doc_for?object=read_file_to_codes/3) and phrase/3, to see the 'remaining' that the grammar cannot digest. Then switch to phrase_from_file/2 when done – CapelliC Jul 06 '15 at 15:34

score 2 · Answer 2 · edited May 23 '17 at 11:52

First, let me suggest that you most probably misrepresent the problem, at least if this is about mRNA-sequences. There, bases occur in triplets, or codons and the start is methionine or formlymethionine, but the end are three different triplets. So most probably you want to use such a representation.

The sequence in between might be defined using all_seq//2, if_/3, (=)/3:

mRNAseq(Cs) -->
   [methionine],
   all_seq(\C^maplist(dif(C),[amber,ochre,opal]), Cs),
   ( [amber] | [ochre] | [opal]).

or:

mRNAseq(Cs) -->
   [methionine],
   all_seq(list_without([amber,ochre,opal]), Cs),
   ( [amber] | [ochre] | [opal]).

list_without(Xs, E) :-
   maplist(dif(E), Xs).

But back to your literal statement, and your question about declarative names. anyseq and seq mean essentially the same.

% :- set_prolog_flag(double_quotes, codes).   % pick this
:- set_prolog_flag(double_quotes, chars).     % or pick that

... --> [] | [_], ... .

seq([]) -->
   [].
seq([E|Es]) -->
   [E],
   seq(Es).

mRNAcontent(Cs) -->
   ...,
   "AA",
   seq(Cs),
   "TT",
   {no_TT(Cs)},  % restriction
   ... .

no_TT([]).
no_TT([E|Es0]) :-
   if([E] = "T",
      ( Es0 = [F|Es], dif([F],"T") ),
      Es0 = Es),
   no_TT(Es).

The meaning of no_TT/1 is: There is no sequence "TT" in the list, nor a "T" at then end. So no_TT("T") fails as well, for it might collide with the subsequent "TT"!

So why is it a good idea to use pure, monotonic definitions? You will most probably be tempted to add restrictions. In a pure monotonic form, restrictions are harmless. But in the procedural version suggested in another answer, you will get simply different results that are no restrictions at all.

Thanks for this answer. I am just experimenting with DCGs and genomic data. Trying to answer some simple questions. I am confused to what no_TT is doing? I am using SWI-prolog, so to use the if/3 I have put :- expects_dialect(sicstus). But I am confused to what this predicate is doing or its expected behavior? — user27815, Jul 07 '15 at 09:33
Examples I have tried: ?- no_TT([]). true. ?- no_TT("T"). false. ?- no_TT("A"). false. ?- no_TT(\`T\`). true. ?- no_TT([1,2,3]). true. ?- no_TT([1,2,3]). true. ?- char_code('T',C). C = 84. ?- no_TT([84]). true. ?- no_TT([84,84]). true. — user27815, Jul 07 '15 at 09:39
You must say `set_prolog_flag(double_quotes, codes).` **before** compiling! — false, Jul 07 '15 at 09:45
Ok what do I do if elsewhere in my code I have swi-prolog strings? — user27815, Jul 07 '15 at 10:02
@user27815: Then you have a problem, indeed. It's your decision what you want. — false, Jul 07 '15 at 10:28
@user27815: To use strings in a conforming manner, you can say `set_prolog_flag(back_quotes,string).` That is, use the backquote notation for strings. — false, Jul 07 '15 at 13:48

In Prolog DCGs, how to remove over general solutions?

2 Answers2