grep -P to find lines containing exactly n A's followed by exactly n B's

Question

Is it possible to write a grep -P (PCRE) command that prints the lines containing only A and B such that there are exactly n A's followed by exactly n B's and no other characters. Such that these are valid matches:

AB
AAABBB
AAAAAAABBBBBBB
AAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBB

while these are not:

AAABB
ABBBBBB
BBBA
ABABA
BBBBBBBB

Please rephrase as a question. Also, you have perl tagged but this is a grep question. — user3243135, Feb 24 '21 at 05:58
What have you tried? What problems are you having? Please show us your code. — Dave Cross, Feb 24 '21 at 08:19
Indeed, an exact duplicate as @JamesBrown linked. With a valid answer there. Voted to close — zdim, Feb 27 '21 at 20:19

score 3 · Answer 1 · answered Feb 24 '21 at 08:51

With normal regular expressions, you can't do this - they can only match regular context-free languages (Type 3 in the Chomsky hierarchy of languages), while what you want to match is a classic example of a type 2 language.

Luckily, perl regular expressions aren't very regular in the formal language theory sense. You can match this using a recursive regular expression:

$ perl -ne 'print if /^((?>A(?1)B|))$/' input.txt
AB
AAABBB
AAAAAAABBBBBBB
AAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBB
$ grep -P '^((?>A(?1)B|))$' input.txt  
AB
AAABBB
AAAAAAABBBBBBB
AAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBB

(Where input.txt contains all your test cases).

This matches either an empty string (0 A's followed by 0 B's), or a string starting with A, a successful recursive match of the pattern against the rest of the string minus the first and last characters, and ending with a B. If a B appears before an A, an A after a B, or the total number of A's and B's don't match, it thus fails. (?>regex) is an optimization that prevents backtracking after a match failure.

If you want to enforce n >= 1, a slight variation to lift one pair of A and B outside of the recursive section: ^A((?>A(?1)B|))B$.

grep -P to find lines containing exactly n A's followed by exactly n B's

1 Answers1

Linked