4

I have a text file containing a sequence. For example:

GGGGGGGGAACCCCCCCCCCTTGGGGGGGGGGGGGGGGAACCCCCCCCCCTTGGGGGGGG

I have wrote the following DCG to find the sequence between AA and TT.

:- use_module(library(pio)).
:- use_module(library(dcg/basics)).
:- portray_text(true).

process(Xs) :- phrase_from_file(find(Xs), 'string.txt').

anyseq([]) -->[].
anyseq([E|Es]) --> [E], anyseq(Es).

begin --> "AA".
end -->"TT".

find(Seq) -->
     anyseq(_),begin,anyseq(Seq),end, anyseq(_).

I query and I get:

?- process(Xs).
 Xs = "CCCCCCCCCC" ;
 Xs = "CCCCCCCCCCTTGGGGGGGGGGGGG...CCCCC" ;
 Xs = "CCCCCCCCCC" ;
false.

But I dont want it to find the second solution or ones like it. Only the solutions between one pair of AA and TTs not all combinations. I have a feeling I could use string_without and string in library dcg basiscs but I dont understand how to use them.

user27815
  • 4,767
  • 14
  • 28
  • Also any advice on naming these, I have find.. but it is not very declarative. – user27815 Jul 06 '15 at 13:42
  • What do you expect for the string `"AAAACCCTT"`? Is it `"AACCC"` or `"CCC"`? – false Jul 06 '15 at 15:20
  • hmm .. good point.. not sure at the moment. The answer below finds three solutions.. that will do for the moment. Thanks for pointing this out. – user27815 Jul 06 '15 at 15:34

2 Answers2

3

your anyseq//1 is identical to string//1 from library(dcg/basics), and shares the same 'problem'.

To keep in control, I would introduce a 'between separators' state:

elem(E) --> begin, string(E), end, !.

begin --> "AA".
end -->"TT".

find(Seq) -->
     anyseq(_),elem(Seq).

anyseq([]) -->[].
anyseq([E|Es]) --> [E], anyseq(Es).

process(Xs) :-
   phrase(find(Xs), `GGGGGGGGAACCCCCCCCCCTTGGGGGGGGGGGGGGGGAACCCCC+++CCCCCTTGGGGGGGG`,_).

now I get

?- process(X).
X = "CCCCCCCCCC" ;
X = "CCCCC+++CCCCC" ;
false.

note the anonymous var as last argument of phrase/3: it's needed to suit the change in 'control flow' induced by the more strict pattern used: elem//1 is not followed by anyseq//1, because any two sequences 'sharing' anyseq//1 would be problematic.

In the end, you should change your grammar to collect elem//1 with a right recursive grammar....

false
  • 10,264
  • 13
  • 101
  • 209
CapelliC
  • 59,646
  • 5
  • 47
  • 90
  • Thanks.. ill have a look at right recursive grammars.. I changed this to `process(Xs):- phrase_from_file(string(Whole_File),'string.txt'), phrase(find(Xs),Whole_File,_).` To get the file input.. is that a good way to do that? – user27815 Jul 06 '15 at 15:20
  • 1
    No, it defeats completely the purpose of phrase_from_file. While debugging, you can use [read_file_to_codes](http://www.swi-prolog.org/pldoc/doc_for?object=read_file_to_codes/3) and phrase/3, to see the 'remaining' that the grammar cannot digest. Then switch to phrase_from_file/2 when done – CapelliC Jul 06 '15 at 15:34
2

First, let me suggest that you most probably misrepresent the problem, at least if this is about mRNA-sequences. There, bases occur in triplets, or codons and the start is methionine or formlymethionine, but the end are three different triplets. So most probably you want to use such a representation.

The sequence in between might be defined using all_seq//2, if_/3, (=)/3:

mRNAseq(Cs) -->
   [methionine],
   all_seq(\C^maplist(dif(C),[amber,ochre,opal]), Cs),
   ( [amber] | [ochre] | [opal]).

or:

mRNAseq(Cs) -->
   [methionine],
   all_seq(list_without([amber,ochre,opal]), Cs),
   ( [amber] | [ochre] | [opal]).

list_without(Xs, E) :-
   maplist(dif(E), Xs).

But back to your literal statement, and your question about declarative names. anyseq and seq mean essentially the same.

% :- set_prolog_flag(double_quotes, codes).   % pick this
:- set_prolog_flag(double_quotes, chars).     % or pick that

... --> [] | [_], ... .

seq([]) -->
   [].
seq([E|Es]) -->
   [E],
   seq(Es).

mRNAcontent(Cs) -->
   ...,
   "AA",
   seq(Cs),
   "TT",
   {no_TT(Cs)},  % restriction
   ... .

no_TT([]).
no_TT([E|Es0]) :-
   if([E] = "T",
      ( Es0 = [F|Es], dif([F],"T") ),
      Es0 = Es),
   no_TT(Es).

The meaning of no_TT/1 is: There is no sequence "TT" in the list, nor a "T" at then end. So no_TT("T") fails as well, for it might collide with the subsequent "TT"!

So why is it a good idea to use pure, monotonic definitions? You will most probably be tempted to add restrictions. In a pure monotonic form, restrictions are harmless. But in the procedural version suggested in another answer, you will get simply different results that are no restrictions at all.

Community
  • 1
  • 1
false
  • 10,264
  • 13
  • 101
  • 209
  • 1
    Thanks for this answer. I am just experimenting with DCGs and genomic data. Trying to answer some simple questions. I am confused to what no_TT is doing? I am using SWI-prolog, so to use the if/3 I have put :- expects_dialect(sicstus). But I am confused to what this predicate is doing or its expected behavior? – user27815 Jul 07 '15 at 09:33
  • Examples I have tried: ?- no_TT([]). true. ?- no_TT("T"). false. ?- no_TT("A"). false. ?- no_TT(\`T\`). true. ?- no_TT([1,2,3]). true. ?- no_TT([1,2,3]). true. ?- char_code('T',C). C = 84. ?- no_TT([84]). true. ?- no_TT([84,84]). true. – user27815 Jul 07 '15 at 09:39
  • 1
    You must say `set_prolog_flag(double_quotes, codes).` **before** compiling! – false Jul 07 '15 at 09:45
  • Ok what do I do if elsewhere in my code I have swi-prolog strings? – user27815 Jul 07 '15 at 10:02
  • @user27815: Then you have a problem, indeed. It's your decision what you want. – false Jul 07 '15 at 10:28
  • 1
    @user27815: To use strings in a conforming manner, you can say `set_prolog_flag(back_quotes,string).` That is, use the backquote notation for strings. – false Jul 07 '15 at 13:48