What is the general pattern for creating a dcg for file input?

Question

I always seem to struggle to write DCG's to parse input files. But it seems it should be simple? Are there any tips or tricks to think about this problem?

For a concrete example, lets say I want to parse a fasta file. (https://en.wikipedia.org/wiki/FASTA_format). I want to read each description and each sequence on back tracking.

:- use_module(library(pio)).
:- use_module(library(dcg/basics)).
:- portray_text(true).
:- set_prolog_flag(double_quotes, codes).
:- set_prolog_flag(back_quotes,string).

fasta_file([]) -->[].
fasta_file([Section|Sections]) -->
   fasta_section(Section),
   fasta_file(Sections).


fasta_section(Section) -->
    fasta_description(Description),
    fasta_seq(Sequence),
    {Section =.. [section,Description,Sequence]}.

fasta_description(Description) -->
    ">",
    string(Description),
    {no_gt(Description),
     no_nl(Description)}.


fasta_seq([]) --> [].
fasta_seq(Seq) -->
    nt([S]),
    fasta_seq(Ss),
    {S="X"->Seq =Ss;Seq=[S|Ss]}.

 nt("A") --> "A".
 nt("C") --> "C".
 nt("G") --> "G".
 nt("T") --> "T".
 nt("X") --> "\n".

 no_gt([]).
 no_gt([E|Es]):-
     dif([E],">"),
     no_gt(Es).

 no_nl([]).
 no_nl([E|Es]):-
     dif([E],"\n"),
     no_nl(Es).

Now this is clearly wrong. The behaviour I would like is

 ?-phrase(fasta_section(S),">frog\nACGGGGTACG\n>duck\nACGTTAG").
 S = section("frog","ACGGGGTACG");
 S = section("duck","ACGTTAG");
 false.

But if I did phrase(fasta_file(Sections),">frog\nACGGGGTACG\n>duck\nACGTTAG). Sections is unified with a list of sections/2s, which is what I want, but my current code seems quite hacky- how I have handled the newline character for example.

[This answer](http://stackoverflow.com/a/12942551/772868) will be of interest to you. — false, Jul 12 '15 at 16:53
Replace `{Section =.. [section,Description,Sequence]}` by `{Section = section(Description,Sequence)}` — false, Jul 12 '15 at 18:26

CapelliC · Accepted Answer · 2015-07-11T22:45:56.483

for sure, there are 'small' typing problems:

nt("A") -->"A",
nt("C") -->"C",
nt("G") -->"G",
nt("T") -->"T".

should be

nt("A") -->"A".
nt("C") -->"C".
nt("G") -->"G".
nt("T") -->"T".

anyway, I also had my problems debugging DCG, I wrote a parser to load in Prolog a MySQL dump (plain SQL, really), and was a pain when something unexpected, like escaped strings, or UTF8 (?) weird encodings were found.

I would suggest to use phrase/3, to see if there is an unparsable tail. Also, could help to place some debug output after known, well behaved sequences.

Of course, I assume you already tried to use the SWI-Prolog debugger.

Also, beware of

...
dif([E],">"),
...

did you set the appropriate flag about double quotes ? In DCG bodies, the rewrite machinery takes care of matching, but a sequence of codes in SWI-Prolog by default doesn't match double quoted strings...

edit

I think this will not solve your doubt about a general strategy... anyway, it's how I would handle the problem...

fasta_file([]) -->[].
fasta_file([Section|Sections]) -->
    fasta_section(Section),
    fasta_file(Sections).

fasta_section(section(Description,Sequence)) -->
    fasta_description(Description),
    fasta_seq(SequenceCs), {atom_codes(Sequence, SequenceCs)}, !.

fasta_description(Description) -->
    ">", string(DescriptionCs), "\n", {atom_codes(Description, DescriptionCs)}.

fasta_seq([S|Seq]) --> nt(S), fasta_seq(Seq).
fasta_seq([]) --> "\n" ; []. % optional \n at EOF

nt(0'A) --> "A".
nt(0'C) --> "C".
nt(0'G) --> "G".
nt(0'T) --> "T".

now

?- phrase(fasta_file(S), `>frog\nACGGGGTACG\n>duck\nACGTTAG`).
S = [section(frog, 'ACGGGGTACG'), section(duck, 'ACGTTAG')] ;
false.

note: the order of clauses fasta_seq//1 is important, since it implements an 'eager' parsing - mainly for efficiency. As I said, I had to parse SQL, several MBs was common.

edit

?- phrase((string(_),fasta_section(S)), `>frog\nACGGGGTACG\n>duck\nACGTTAG`,_).
S = section(frog, 'ACGGGGTACG') ;
S = section(duck, 'ACGTTAG') ;
false.

fasta_section//1 is mean to match a definite sequence. To get all on backtracking we must provide a backtrack point. In this case, string//1 from library(dcg/basics) does the job

Sorry the commas were a typo in the question. I am now using double quotes for these things and back quotes for strings as it seems to be the most compatible way of doing things.. — user27815, Jul 11 '15 at 15:00
What about if I wanted to use fasta_section//1 and back track to find each solution? — user27815, Jul 11 '15 at 17:39

What is the general pattern for creating a dcg for file input?

1 Answers1