Matching As followed by the same number of Bs in egrep

Question

Suppose I want to match a pattern with the exact same number of characters A and B such that there are exactly n A's followed by exactly n B's. For example, the following strings can be matched.

AB
AABB
AAABBB

On the other hand, these strings cannot be matched

BA
AAABB
AABBB
ABAB

To approach the problem, I am thinking about the repetition counts, so my attempt looks like this

egrep 'A{n}B{n}'

of course, however, the repetition count n inside the curly bracket cannot be defined implicitly.

While I know how to write programs to match it, I am testing this on Mac terminal, hence I am trying to exploit any possible features of egrep to write the one sentence pattern.

So could anyone please help me solve this problem and any help will be appreciated.

This is pretty much a duplicate of [How can we match a^n b^n with Java regex?](https://stackoverflow.com/q/3644266/7586). The broader answer is that regular expressions usually have trouble counting things. — Kobi, Aug 03 '17 at 15:07
Indeed, but the tricky part of this problem is that I have to run it on Unix, using egrep. — Peter, Aug 03 '17 at 15:20
`egrep` alone won't be able to solve it. You need at least `gnu grep`. (see my answer below) — anubhava, Aug 03 '17 at 15:35

anubhava · Answer 1 · 2017-08-03T15:34:12.050

If you have gnu grep then you can use this recursive PCRE regex:

grep -P '^(A(?1)?B)$' file

AB
AABB
AAABBB

Or else, you can use this non-regex approach using awk:

awk '(n=index($0, "B")) && length(substr($0, 1, n-1)) == length(substr($0, n))' file

AB
AABB
AAABBB

This awk find presence of first B using index function and extracts 2 substrings i.e all the As and all the Bs and prints each record if length of As substring is same length of Bs substring.

score 0 · Answer 2 · answered Aug 03 '17 at 15:57

Two alternative GNU awk approaches:

-- using match function:

awk '{ match($0,/^(A+)(B+)$/,a) }length(a) && length(a[1])==length(a[2])' file

-- using FPAT variable to define field value

awk -v FPAT="A|B" '{ for(i=1;i<=NF;i++) { ($i=="A" && $(i-1)!="B")? a++:b++ } }a==b' file

The output (for both approaches):

AB
AABB
AAABBB

score 0 · Answer 3 · answered Jun 12 '22 at 05:34

We could do this in awk, with your shown samples please try following awk. Written and tested in GNU awk, should work with any version of awk.

awk '
match($0,/^A+/){
  len=RLENGTH
  match(substr($0,RSTART+RLENGTH),/^B+/)
  if(len==RLENGTH){ print }
}
'  Input_file

score 0 · Answer 4 · answered Jun 12 '22 at 07:07

if the lines only have A's and B's and nothing else, then here's an awk-based solution without needing to use match() function :

BBBBBBAAAAAAA
AAAABBBBBB
AAAAAAAAAAAABBBBBBBBBBBBBBBBBB
AABBBB
BBAAA
AAAABBBB
AAAAAAABBBBBBAAAAAABBBBBB
AAAABBB
BBAA
AAABBAABB
AAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBB
AAABBB
AAAAABBBB
AAABB
AABB
AAAAAAAAAAAABBBBBBBBBBBB
BAA
AABBAABB
AAAAAABBBBBB
AAAAAABBBBBBAAAAAABBBBBB
AAABB
AAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBB
ABAB
AAAAAAABBBB
AAAAAAABBBBBB
AAAAAABBBB
AAABB
AABB
AAAAAAAAAAAAABBBBBBBBBBBB
AAAAAAABBBBBB
AB
AAAAABBBBBB
BBBBBBAAAAAA
AAB
AABBB
AAABBB
AAAAAAAAAAAAAAAAAABBBBBBBBBBBB
AAAAAAAAAAAAABBBBBBBBBBBBBBBBBB
AAAAAAAAAAAAAAAAAAABBBBBBBBBBBB
AAAAAABBBBBB
AABAB
BA

|

 {m,g}awk '((_=length)%__)</^A+B+$/ && _==index($!__,"AB")*__' FS='^$' __=2

 1  AAAABBBB
 2  AAABBB
 3  AABB
 4  AAAAAAAAAAAABBBBBBBBBBBB
 5  AAAAAABBBBBB
 6  AAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBB
 7  AABB
 8  AB
 9  AAABBB
10  AAAAAABBBBBB

which can be quickly checked visually :

              AAAABBBB
               AAABBB
                AABB
      AAAAAAAAAAAABBBBBBBBBBBB
            AAAAAABBBBBB
AAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBB
                AABB
                 AB
               AAABBB
            AAAAAABBBBBB

Matching As followed by the same number of Bs in egrep

4 Answers4

Linked

Related