15

The problem

I'm trying to create a regex in which we can check if all letters present in some reference set are present in some other string, but only in odd numbers (1, 3, 5, ...).

Here is a (very) crude image of the DFA representing the problem:

Odd As and Bs DFA

My (broken) solution

I started using a finite set, {a, b}, so I would essentially check "are there both an odd number of as and an odd number of bs in the string?"

Unfortunately I did not get far on my own. I first read this thread, which is remarkably similar to this concept, but was not able to glean an answer from (aa|bb|(ab|ba)(aa|bb)*(ba|ab))*(b|(ab|ba)(bb|aa)*a). (I understand how it works, but not how to convert it to check odd numbers for both items present.)

Here is what I've come up with so far: ^((ab|ba)(bb|aa)?|(bb|aa)?(ab|ba))+$. This basically checks if there is ab or ba followed by bb or aa or nothing, which would result in ab, ba, abaa, abbb, baaa, or babb. (It also does the reverse of this, checking the double-letter first.) This can then repeat, indefinitely. The problem I have is I cannot seem to adjust it to match the string bbaaba without also matching bbaa.

Additionally, the method above can not be dynamically adjusted to account for {a, b, c}, for example, though I'm willing to forgo this to solve the initial problem.

Testing

Here are my test strings and the desired output, with the reasons in parentheses:

"ba"      # True (1a, 1b)
"abbb"    # True (1a, 3b)
"bbba"    # True (1a, 3b)
"bbab"    # True (1a, 3b)
"ababab"  # True (3a, 3b)
"bbaaba"  # True (3a, 3b)
"abb"     # False (2b)
"aabb"    # False (2a, 2b)
"aabba"   # False (2b)
""        # False (0a, 0b is "even")
"a"       # False (0b is "even")
"b"       # False (0a is "even")

Question

So, is this possible through regex? Or are regular expressions more limited than a DFA? I am aware that it can be done through a basic loop, but this isn't what I'm going for.

Community
  • 1
  • 1
Cat
  • 66,919
  • 24
  • 133
  • 141
  • 3
    All problems that can be solved with a DFA can be solved with a regular expression, so at least that part is answerable – David Robinson Sep 14 '12 at 20:23
  • @DavidRobinson Good to know! I do have a DFA for this problem, so that should mean there is a solution to this particular problem, then. – Cat Sep 14 '12 at 20:26
  • @Eric: Its is possible in [perl](http://stackoverflow.com/q/3965497/776084). I tried this `my $count = () = 'bbaaba' =~ m!(?=a)!g;` and value of `$count is 3`. – RanRag Sep 14 '12 at 20:30
  • @RanRag I'm not familiar with Perl's regex; however, wouldn't that require 2 regex tests? One for `a`, and one for `b`? – Cat Sep 14 '12 at 20:31
  • @Eric:I am also not much familiar with perl's regex but according to my limited knowledge it will require 2 tests. – RanRag Sep 14 '12 at 20:33
  • It is perhaps not possible to do it with a single regex. In Perl, even if this is done by some dynamic code in the substitution part, that is essentially compressing 5 lines of code to fit into one. – SexyBeast Sep 14 '12 at 20:36
  • @Cupidvogel According to David's comment, it should then require 2 DFAs to represent... which it doesn't. – Cat Sep 14 '12 at 20:37
  • 2
    But sum of two even numbers is also even. – SexyBeast Sep 14 '12 at 20:40
  • @Cupidvogel: yeah right silly me. – RanRag Sep 14 '12 at 20:40
  • I'm having trouble visualizing the DFA in my head. Care to post a gist (or other paste) detailing the formulation? – Platinum Azure Sep 14 '12 at 20:40
  • 2
    @PlatinumAzure I opened up MS Paint (good ol' Paint!) and whipped up a very crude diagram of the DFA. I've added it to my original post. – Cat Sep 14 '12 at 20:47

3 Answers3

11

Regexes are not more limited than a DFA; in fact, they are equivalent. (Perl-style "regexes" with backreferences are strictly more powerful, so they are not "regular" at all.)

We can easily write the regex if the string contains only as:

a(aa)*

And if other letters may also occur in between, we can still do it by simply ignoring those characters:

[^a]*a([^a]*a[^a]*a)*[^a]*

Because regexes are equivalent to DFA's, we have a DFA for each of the individual letters. It's pretty simple, actually:

 [^a] _      [^a] _
     / \         / \
     | v   a     | v
---> (0) -----> ((1))
         <-----
            a

State (0) is the start state ("even number of a's seen") and state ((1)) is the only accepting state ("odd number of as seen"). If we see an a, we go to the other state; for any other character, we remain in the same state.

Now the nice thing about DFAs is that they are composable. In particular, they are closed under intersection. This means that, if we have a DFA that recognizes the language "string containing an odd number of as", and another one that recognizes the language "string containing an odd number of bs", we can combine them into a DFA that recognizes the intersection of these two languages, that is, "string containing an odd number of a's and an odd number of b's".

I won't go into detail about the algorithm but this question has some pretty good answers. The resulting DFA will have four states: "even number of as seen, even number of bs seen", "even number of as seen, odd number of bs seen", etcetera.

And because DFAs are equivalent to regexes, there also exists a regex that matches precisely those strings. Again, I won't go into details about the algorithm, but here is an article that explains it pretty well. Conveniently, it also comes with some Python 3 code to do the dirty work:

>>> from fsm import fsm
>>> a = fsm(
      alphabet = {'a', 'b'},
      states = {0, 1, 2, 3},
      initial = 0,
      finals = {3},
      map = {
        0: {'a': 1, 'b': 2},
        1: {'a': 0, 'b': 3},
        2: {'a': 3, 'b': 0},
        3: {'a': 2, 'b': 1}
      }
    )
>>> str(a.lego())
'a*(ab|b(ba*b)*(a|ba+b))((a|ba+b)(ba*b)*(a|ba+b)|ba*b)*'

There might be a bug in the library, or I'm using it wrong, because the a* at the start cannot possibly be right. But you get the idea: although it's theoretically possible, you really don't want to use regexes for this!

Community
  • 1
  • 1
Thomas
  • 174,939
  • 50
  • 355
  • 478
  • Wow, very informative, and very good links--especially that FSM link! Oddly, that regex with the `a*` does work in all my test cases (and a few others I came up with to test it out). Not sure I quite understand aspects of that regex, but this solution is very detailed. Thank you! – Cat Sep 14 '12 at 21:06
  • It can't be right, because it'll let you tack on another `a` at the beginning of your string and it will still match. – Thomas Sep 14 '12 at 21:09
  • Ah, yes. `aabbb` matches, and shouldn't. Good catch. – Cat Sep 14 '12 at 21:11
8

Here's one way to do it, using lookaheads to assert each condition in turn.

^(?=[^a]*a(?:[^a]*a[^a]*a)*[^a]*$)(?=[^b]*b(?:[^b]*b[^b]*b)*[^b]*$)(.*)$

Here's a demo with your examples. (The \ns in the demo are for presentation purposes. Also, you can drop the (.*)$ if you only need to test the match, not capture.)

I will be adding an explanation shortly.


Explanation

We only need to look at one half:

(?=  [^a]*a  (?:[^a]*a[^a]*a)  *  [^a]*$  )
|    |       |                 |  |
|    |       |                 |  Only accept non-'a's to the end.
|    |       |                 |
|    |       |                 Zero or more of these pairs of 'a's.
|    |       |
|    |       Strictly a pair of 'a's.
|    |
|    Find the first 'a'.
|
Use a lookahead to assert multiple conditions.
Andrew Cheong
  • 29,362
  • 15
  • 90
  • 145
  • 2
    That's one heck of a cryptic regex! Waiting for the explanation. – SexyBeast Sep 14 '12 at 20:41
  • I'll admit I've never been good with lookaheads, but what's the point of doing a lookahead with the match and then capturing `.*`? Couldn't you just put the lookahead expression in a basic capture and have that be your match? – Platinum Azure Sep 14 '12 at 20:43
  • I understand that you strictly find a pair of `a`s in the middle bit, but how does that prevent two `b`s? Is that what's checked explicitly in the second half of the regex? – Cat Sep 14 '12 at 20:49
  • @PlatinumAzure - Yup, you're right. I guess personally I still prefer my way as I find it a bit more readable than tracking parentheses, but you're definitely right. – Andrew Cheong Sep 14 '12 at 20:50
  • @Eric - Precisely. I find lookaheads useful for doing independent "AND" statements like these. – Andrew Cheong Sep 14 '12 at 20:51
  • Hm, I used ``[^a]`` and ``[^b]`` because I thought your "real" input would have characters other than 'a' and 'b', but if that's not the case, than this expression can probably be simplified. – Andrew Cheong Sep 14 '12 at 20:52
  • If you post the simplified version, please leave this version up as well. They're both handy for different purposes. And you've effectively answered the question, thank you! It seems I need to do more research into lookaheads... – Cat Sep 14 '12 at 20:56
  • Oh hang on, now I see you're using two of them. Yes, now we have a use for the lookaheads. When I made my comment there was only one! :-) – Platinum Azure Sep 14 '12 at 20:59
4

Yes:

^(?=b*(?:ab*ab*)*ab*$)(?=a*(?:ba*ba*)*ba*$)

Explanation:

^             # Start of string
(?=           # Assert that it's possible to match
 b*           # any number of 'b's
 (?:ab*ab*)*  # followed by an even number of 'a's with optional 'b's in-between
 ab*          # followed by one 'a' and optional 'b's
 $            # until the end of the string.
)             # End of lookahead
(?=a*(?:ba*ba*)*ba*$)  # Same thing, vice versa

The regex itself doesn't match any characters so you will always get an empty string as the match result (which is different from getting None as a match result):

>>> import re
>>> re.match("^(?=b*(?:ab*ab*)*ab*$)(?=a*(?:ba*ba*)*ba*$)", "ab")
<_sre.SRE_Match object at 0x00000000022AA7E8>
>>> re.match("^(?=b*(?:ab*ab*)*ab*$)(?=a*(?:ba*ba*)*ba*$)", "aab")
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • 2
    Hmm.. actually, it seems I get `False` for every test when using this regex. – Cat Sep 14 '12 at 20:40
  • 1
    The regex does match, but the matching string is empty. Could that be the problem? Just add `.*` at the end and try again. – Tim Pietzcker Sep 14 '12 at 20:42
  • I wonder if he's testing against strings not mentioned in his example above, which might have characters other than just 'a' and 'b'? – Andrew Cheong Sep 14 '12 at 20:53
  • @acheong87 I had considered that at first, but had removed that as a requirement to simplify the problem. – Cat Sep 14 '12 at 20:58