Capture matching substrings from a DCG

Question

Using regular expressions makes it quite easy to capture sub strings, e.g. the string "Jaco was an American bassist" matches this regular expression (PCRE2 syntax):

(?sm)^([Jj]aco).+(was|is).+?(American|famous).+(dancer|bassist|singer|programmer|dueller)

and captures these strings

Jaco
was
American
bassist.

Here is a DCG that matches the string as well as generating all the possible strings. But it doesn't capture the specific sub strings.

jaco_bassist --> ("J" ; "j"), "aco", space, ("was" ; "is"), space, ("a" ; "an"), space,
                   ("American" ; "famous"), space,
                   ("dancer" ; "bassist" ; "singer" ; "programmer" ; "dueller").
space --> " ".

What would be the best - or at last a good - way of getting the same captures using Prolog's DCGs. Preferably an approach that also generates the possible strings.

For simple problems like this one can use member/2 to enumerate all the alternatives:

jaco_bassist2([Name,WasIs,Adj,Noun]) -->  who(Name), space, was_is(WasIs), space,
                                          ("a" ; "an"), space, adj(Adj), space,
                                          noun(Noun).

who(Who) --> [Who], {member(Who,["Jaco","jaco"])}.
was_is(WasIs) --> [WasIs], {member(WasIs,["was","is"])}.
adj(Adj) --> [Adj], {member(Adj,["American","famous"])}.
noun(Noun) --> [Noun], {member(Noun,["dancer","bassist","singer","programmer","dueller"])}.

To get the captures:

 % ...
 phrase(jaco_bassist2,[Who,WasIs,Adj,Noun], String)

A major drawback of this approach is that for more complex structures the enumeration can be a little tricky, for example if the name in the subject string instead of "[Jj]aco" would be one of the 48 spellings of my last name (kjellerstrand):

kjellerstrand --> "k", ("je" ; "ä"), "ll", ("" ; "er" ; "ar"), 
                  ("st" ; "b"), ("" ; "r"), "a", (""; "n"), "d".

Please note that I'm looking for "basic" DCG, for example those supported by e.g. B-Prolog (i.e. not requiring SWI-Prolog's fancy DCG stuff).

false · Accepted Answer · 2022-10-28T06:23:57.330

Let me re-phrase that: Given a goal phrase(NT__0, Cs0,Cs), capture the sequence described by NT__0.

First of all we need to restrict ourselves to DCGs without semicontext. For a (non-empty) semicontext may be represented with two variables (which in that context do not form a difference) but cannot be captured with a single list.

append(Capture, Cs, Cs0) should be it. At least declaratively when considering only ground terms.

as --> "" | "a", as.

?- Cs = [], phrase(as, Cs0,Cs), append(Capture, Cs, Cs0).
   Cs = [], Cs0 = [], Capture = []
;  Cs = [], Cs0 = "a", Capture = "a"
;  Cs = [], Cs0 = "aa", Capture = "aa"
;  Cs = [], Cs0 = "aaa", Capture = "aaa"
;  Cs = [], Cs0 = "aaaa", Capture = "aaaa"
;  ... .
?- phrase(as, Cs0,Cs), append(Capture, Cs, Cs0).
   Cs0 = Cs, Capture = []
;  Cs0 = [_A|Cs0], Cs = [_A|Cs0], Capture = [_A], unexpected
;  Cs0 = [_A,_B|Cs0], Cs = [_A,_B|Cs0], Capture = [_A,_B], unexpected
;  ... .
?- set_prolog_flag(occurs_check,true).
   true.
?- phrase(as, Cs0,Cs), append(Capture, Cs, Cs0).
   Cs0 = Cs, Capture = []
;  loops, unexpected.

So far, the procedural reality of Prolog is a bit different. append/3 only works for lists but not for partial lists. There infinite, rational trees show up. And the occurs-check does not help that much, it just prevents the display of such answers, but keeps non-termination.

Time for a new version of append/3, append2u/3

?- set_prolog_flag(occurs_check,false).
   true.
?- phrase(as, Cs0,Cs), append2u(Capture, Cs, Cs0).
   Cs0 = Cs, Capture = []
;  Cs0 = [a|Cs0], Cs = [a|Cs0], Capture = [], unexpected
;  Cs0 = [a|Cs], Capture = "a", dif:dif(Cs,[a|Cs])
;  Cs0 = [a,a|Cs0], Cs = [a,a|Cs0], Capture = [], unexpected
;  Cs0 = [a,a|Cs], Capture = "aa", dif:dif(Cs,[a,a|Cs])
;  Cs0 = [a,a,a|Cs0], Cs = [a,a,a|Cs0], Capture = [], unexpected
;  ... .
?- set_prolog_flag(occurs_check,true).
   true.
?- phrase(as, Cs0,Cs), append2u(Capture, Cs, Cs0).
   Cs0 = Cs, Capture = []
;  Cs0 = [a|Cs], Capture = "a"
;  Cs0 = [a,a|Cs], Capture = "aa"
;  Cs0 = [a,a,a|Cs], Capture = "aaa"
;  ... .

So with the help of the occurs-check it is possible to get this right, also for the more general case. A new non-terminal phrase_capture//2 now uses the following internal definition:

phrase_capture(NT__0, Capture, S0,S) :-
   phrase(NT__0, S0,S1),
   append2u(Capture, S1, S0),
   S1 = S.

For systems without a built-in occurs-check like B, rewrite append2u/3 using unify_with_occurs_check/2 explicitly. That is, also for (\=)/2.

Some further optimizations may be done to avoid costs that depend on the size of Cs0+Cs instead of the length of Capture. Like special casing for var(Cs), Cs == [], and partial strings. If Cs is a list constructor, an internal implementation may also just skip through Cs0 to find that very address of Cs first, and only resort to more costly means otherwise. Care must be given to ensure that this is always terminating, thus using mechanisms similar to '$skip_max_list'/4.

Also, what to do if Cs0 and Cs do not fit, that is, if they are not the result of a valid grammar. Such a case may happen with generalizations to explain unexpected failure.

Usage:

jaco_bassist([Name,WasIs,Adj,Noun]) -->
   phrase_capture( (("J" ; "j"), "aco"), Name),
   space,
   phrase_capture( ("was" ; "is"), WasIs),
   space,
   ("a" ; "an"),
   space,
   phrase_capture( ("American" ; "famous"), Adj),
   space,
   phrase_capture( ("dancer" ; "bassist" ; "singer" ; "programmer" ; "dueller"), Noun).

?- phrase(jaco_bassist(D), Ys).
   D = ["Jaco","was","American","dancer"], Ys = "Jaco was a American ..."
;  D = ["Jaco","was","American","bassist"], Ys = "Jaco was a American ..."
;  D = ["Jaco","was","American","singer"], Ys = "Jaco was a American ..."
;  ...
;  D = ["jaco","is","famous","dueller"], Ys = "jaco is an famous d ...".

So this version terminates also when generating strings. And it has the potential to incur costs that are in many cases only depending on the length of the captured string. The original version using append/3 will always visit the entire string.

Lest I forget, there will always be some oddities should you be into infinite lists. Think of:

?- phrase("abc",L0,L0).
   L0 = [a,b,c|L0].
?- phrase("abc",L0,L0), phrase(phrase_capture("abc",Capture),L0,L).
   L0 = [a,b,c|L0], Capture = [], L = [a,b,c|L0], unexpected.
   L0 = [a,b,c|L0], Capture = "abc", L = [a,b,c|L0]. % expected

These are all typical paradoxa that infinite lists ensue. First luring people into them only to smother them.

The following version of phrase_capture//2 does not rely on internal details. It uses the ^s of library(lambda) which are responsible for parameter passing only. (The other lambda-related construct \ is for renaming.)

phrase_capture(NT__0, Capture) -->
   call(S0^S0^true),
   NT__0,
   call(S1^S1^true),
   {append2u(Capture, S1, S0)}.

Thanks for this. However, I might be dense but I fail to understand how I can use this to actually capture some selected parts. — hakank, Oct 05 '22 at 08:32
Thanks for the update on Usage. In B-Prolog the "reif" library is not supported, so I have to use `append/3`. Then it certainly captures the sub strings for string "Jaco was an American bassist". That's great. However, regarding generation of solutions, in B-Prolog (as in Picat) it just generates the first solution "Jaco was a American dancer" and then loops forever. But I can still use this for capturing which is what I asked for. — hakank, Oct 05 '22 at 09:50
And now I've got `append2u/3` to work in B-Prolog (and Picat) so it now generates all solutions as well as terminates properly. Great! — hakank, Oct 05 '22 at 11:20
@hakank: It's such a pity that B is no longer developed. It has many advantages over the other abstract machines. Just think of the way built-ins and couroutining is implemented. — false, Oct 05 '22 at 15:40
I agree. Personally I'm quite happy that Neng-Fa created Picat which is kind of a continuation of B-Prolog, It support much of B-Prolog's feature (except for certain meta-predicates). For example, your code above works in Picat with just a few modifications. — hakank, Oct 05 '22 at 16:04
@hakank: Did you have a look at [Indexing dif/2](https://arxiv.org/abs/1607.01590) — false, Oct 06 '22 at 07:33
Thanks for the pointer. I'll (re-)read it and test with the `if_/3` from the paper. — hakank, Oct 06 '22 at 08:33
@hakank: [Another](https://stackoverflow.com/q/33328762/772868) effectively still open question. — false, Oct 06 '22 at 08:43

brebs · Answer 2 · 2022-10-07T12:53:17.347

Isn't this as simple as:

% Show lists of codes as text (if 3 chars or longer)
:- portray_text(true).

sentence([P, T]) --> person(P), space, tense(T).

person(N, DL, T) :- 
    member(N, [`Jaco`, `jaco`]),
    list_to_dl(N, DL, T).

tense(N, DL, T) :-
    member(N, [`was`, `is`]),
    list_to_dl(N, DL, T).

space --> " ".

list_to_dl([], T, T).
list_to_dl([H|T], [H|T2], Tail) :-
    list_to_dl(T, T2, Tail).

Results in swi-prolog (so you'll have to tweak the quoting to suit your Prolog system):

?- time(phrase(sentence(S), `Jaco is`)).
% 25 inferences, 0.000 CPU in 0.000 seconds (85% CPU, 797728 Lips)
S = [`Jaco`,[105,115]] ;
% 7 inferences, 0.000 CPU in 0.000 seconds (79% CPU, 311360 Lips)
false.

... and it can generate:

?- time(phrase(sentence(S), L)).
% 24 inferences, 0.000 CPU in 0.000 seconds (85% CPU, 767043 Lips)
S = [`Jaco`,`was`],
L = `Jaco was` ;
% 7 inferences, 0.000 CPU in 0.000 seconds (75% CPU, 392971 Lips)
S = [`Jaco`,[105,115]],
L = `Jaco is` ;
% 17 inferences, 0.000 CPU in 0.000 seconds (82% CPU, 667504 Lips)
S = [`jaco`,`was`],
L = `jaco was` ;
% 8 inferences, 0.000 CPU in 0.000 seconds (62% CPU, 460750 Lips)
S = [`jaco`,[105,115]],
L = `jaco is`.

To handle the surname - can use term expansion to automate the string duplication:

sentence([P, T, SN]) -->
    dcg(person, P), space, dcg(tense, T), space, surname(SN).

space --> " ".

surname(SN) -->
    dcg(sn1, SN1), dcg(sn2, SN2), dcg(sn3, SN3),
    dcg(sn4, SN4), dcg(sn5, SN5), dcg(sn6, SN6),
    dcg(sn7, SN7), dcg(sn8, SN8), dcg(sn9, SN9),
    { append([SN1, SN2, SN3, SN4, SN5, SN6, SN7, SN8, SN9], SN) }.

term_expansion(expand(Name, Codes), [dcg(Name, Codes) --> Codes]).

expand(person, `Jaco`).
expand(person, `jaco`).

expand(tense, `was`).
expand(tense, `is`).

expand(sn1, `k`).
expand(sn2, `je`).
expand(sn2, `ä`).
expand(sn3, `ll`).
expand(sn4, ``).
expand(sn4, `er`).
expand(sn4, `ar`).
expand(sn5, `st`).
expand(sn5, `b`).
expand(sn6, ``).
expand(sn6, `r`).
expand(sn7, `a`).
expand(sn8, ``).
expand(sn8, `n`).
expand(sn9, `d`).

... which can both parse and generate - results in swi-prolog:

?- time(phrase(sentence(S), `jaco is kjellerstrand`)).
% 61 inferences, 0.000 CPU in 0.000 seconds (89% CPU, 1618037 Lips)
S = [`jaco`,[105,115],`kjellerstrand`] ;
% 5 inferences, 0.000 CPU in 0.000 seconds (79% CPU, 295299 Lips)
false.

?- time(phrase(sentence(S), L)).
% 54 inferences, 0.000 CPU in 0.000 seconds (90% CPU, 1390570 Lips)
S = [`Jaco`,`was`,`kjellstad`],
L = `Jaco was kjellstad` ;
% 37 inferences, 0.000 CPU in 0.000 seconds (79% CPU, 1141236 Lips)
S = [`Jaco`,`was`,`kjellstand`],
L = `Jaco was kjellstand` ;
% 39 inferences, 0.000 CPU in 0.000 seconds (87% CPU, 1291519 Lips)
S = [`Jaco`,`was`,`kjellstrad`],
L = `Jaco was kjellstrad` ;
% 38 inferences, 0.000 CPU in 0.000 seconds (85% CPU, 1573173 Lips)
S = [`Jaco`,`was`,`kjellstrand`],
L = `Jaco was kjellstrand` ;
% 38 inferences, 0.000 CPU in 0.000 seconds (86% CPU, 1382774 Lips)
S = [`Jaco`,`was`,`kjellbad`],
L = `Jaco was kjellbad`
etc.

Thanks. This is a little better than my own `member/2` approach in the question. However, using `member/2` to enumerate all variants is quite unmanageable for more complex patterns, such as all the variants of my last name. — hakank, Oct 07 '22 at 11:16
I tested your term-expansion approach in B-Prolog and it throws an error on `term_expansion`. SICStus Prolog, gprolog and Scryer Prolog) croaked with different errors on this as well. Also, it's the approach rather more complex than false's approach since one have to handle each part both with `expand/2` and also combine the parts using `dcg/2`. — hakank, Oct 07 '22 at 15:46

Capture matching substrings from a DCG

2 Answers2