-2

Is it possible to write a grep -P (PCRE) command that prints the lines containing only A and B such that there are exactly n A's followed by exactly n B's and no other characters. Such that these are valid matches:

AB
AAABBB
AAAAAAABBBBBBB
AAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBB

while these are not:

AAABB
ABBBBBB
BBBA
ABABA
BBBBBBBB
zdim
  • 64,580
  • 5
  • 52
  • 81

1 Answers1

3

With normal regular expressions, you can't do this - they can only match regular context-free languages (Type 3 in the Chomsky hierarchy of languages), while what you want to match is a classic example of a type 2 language.

Luckily, perl regular expressions aren't very regular in the formal language theory sense. You can match this using a recursive regular expression:

$ perl -ne 'print if /^((?>A(?1)B|))$/' input.txt
AB
AAABBB
AAAAAAABBBBBBB
AAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBB
$ grep -P '^((?>A(?1)B|))$' input.txt  
AB
AAABBB
AAAAAAABBBBBBB
AAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBB

(Where input.txt contains all your test cases).

This matches either an empty string (0 A's followed by 0 B's), or a string starting with A, a successful recursive match of the pattern against the rest of the string minus the first and last characters, and ending with a B. If a B appears before an A, an A after a B, or the total number of A's and B's don't match, it thus fails. (?>regex) is an optimization that prevents backtracking after a match failure.

If you want to enforce n >= 1, a slight variation to lift one pair of A and B outside of the recursive section: ^A((?>A(?1)B|))B$.

Shawn
  • 47,241
  • 3
  • 26
  • 60