-5

I am reading the dragon book, and got stuck on some regular expressions exercises, it is intriguing, can some one help? Please write a regular expression for the following languages:

Problem 1:
All strings of a's and b's with an even number of a's and an odd number of b's.

Problem 2:
All strings that do not contain the subsequence abb. (A Subsequence of s is any string formed by deleting zero or more not necessarily consecutive positions of s. e.g. baan is a subsequence of banana.)

Problem 3:
All strings of digits with no repeated digits (Repeated digits are not necessarily consecutive).

Problem 4:
All strings of digits with at most one repeated digits (Repeated digits are not necessarily consecutive).

And please use POSIX grammar so that we can both understand it well. Also I wonder if we can only use three basic language operation, i.e. union, concatenation and closure to achieve them?

Thanks.

stema
  • 90,351
  • 20
  • 107
  • 135
wangshuaijie
  • 1,821
  • 3
  • 21
  • 37
  • 2
    If I answer all four problems do I get 4 times the upvotes? I think that you should ask more specific questions, or at least show what you have tried. – Explosion Pills Feb 05 '13 at 03:43
  • 3
    Tell us what you have tried and we can help you. If we just give you the answer you're not getting anything from this at all. There are plenty of example regular expressions online if you just want to see how they work. – Michael Feb 05 '13 at 03:45
  • @ExplosionPills Those are 5 questions: *Also I wonder if we can only use three basic language operation, i.e. union, concatenation and closure to achieve them?* – Oscar Mederos Feb 05 '13 at 04:32

1 Answers1

1

If you have a strict regular expression (concatenation, union, closure), you will be able to convert it to POSIX regex. If you have a POSIX regex, you will be able to convert it to any of the more sophisticated regex engine.

Assumption (READ THIS FIRST)

This answer assumes the level of support of JavaScript regex (has look-ahead but no look-behind), but we also assume that . matches any character without exception. The regex here will assume that the whole string is matched against the pattern.

Problem 1

^(?=(?:b*ab*a)*b*$)(?=(?:a*ba*b)*a*ba*$)[ab]*$

It is possible to construct a DFA for this problem, hence, a regular expression with only concatenation, union and closure exists. (The numbers in the states in the DFA denotes the parity of number of a's and b's).

DFA Problem 1

However, the strict regular expression is quite a mess to derive; look-around simplifies the way we think.

  • (?=(?:b*ab*a)*b*$) is a look-ahead, (due to the effect of ^ and $) checking that the whole string matches (b*ab*a)*b* - which means even number of a's, with b optionally and finitely many in between and surrounding.
  • (?=(?:a*ba*b)*a*ba*$) is a look-ahead, checking that the whole string matches (a*ba*b)*a*ba* - which means odd number of b's, with a optionally and finitely many in between and surrounding.

We will need 3 passes through the string to check the string satisfies the given properties with the regex above.

With POSIX regex, you can test a string 2 times against these 2 regex to assert the 2 properties:

^(b*ab*a)*b*$
^(a*ba*b)*a*ba*$

But since a strict regular expression exists and feasible, there exists a single POSIX regex to match a string that satisfies the properties. Please read this post for the method to convert a DFA to strict regular expression.

Problem 2

^(?!.*a.*b.*b).*$

Just a look-ahead, checking that there is no way to match the sub-sequence abb.

Since the alphabet is not specified, which may be interpreted as any and all characters there are, writing this in strict regular expression is possible but impractical. However, it can be written with ease if negated character class is allowed:

([^a]*a[^b]*b[^b]*|[^a]*a[^b]*|[^a]*)

(The above is strict regular expression, but it can be re-written as POSIX regular expression easily)

DFA Problem 2

The 3 items in alternation (from right to left) correspond to 3 states in the DFA: not yet found an a; has an a, but no b ahead; has an a and b followed somewhere after, but not yet found a second b. There is another trap state (no possible transition to the goal state) in the DFA: the sub-sequence abb has been found.

Problem 3

^(?!\d*(\d)\d*\1)\d+$

Using negative look-ahead, we will just check that there isn't a match where we can find a character being repeated. Since this is a string of digit, I disallow empty string to match by specifying at least 1 digit \d+. The key-point in this regex is the use of captured group and back-reference.

Again, there exists a strict regular expression for this, but it is impractical, since writing such regex is almost equivalent to listing out all the possible matches one by one. There are a total of Sum [i = 1..10] ( 10! / i! ) possible matches (leading 0 allowed). There is no way around it, since whether the next character is allowed or not is heavily dependent on all the characters that have been processed so far.

This is also impractical in POSIX regex, since there is no mechanism to negate a capturing group.

Problem 4

Can't come up with a better regex for this one...

^(?!(?=.*(.).*\1).*((?!\1).).*\2)(?!.*(.)(?:.*\3){2})\d+$

There are 2 conditions for failing: either the digits is repeated more than 2 times, or there are 2 different digits being repeated.

  • (?!(?=.*(.).*\1).*((?!\1).).*\2) checks that there are no 2 different digits being repeated. This look-ahead (?=.*(.).*\1) picks out one character being repeated (at least twice), and .*((?!\1).).*\2 picks a different digit than the one checked before, due to ((?!\1).), and try to find repeated digit.
  • (?!.*(.)(?:.*\3){2}) checks that a digit is repeated at most twice. The pattern inside try to match 3 instances of a digit.

This has the same problem as Problem 3, so writing a strict regular expression or POSIX regex is not practical. If you are curious, there are Sum [i = 1..10] ( 10! / i! ) + Sum [i = 1..10] ( (i + 1)C2 * 10! / i! ) possible matches (leading 0 allowed).


For problem 3 and 4, especially problem 4, it is more practical to check that the string only contains digits, then sort the digits and loop over it to check for repeated digits.

Community
  • 1
  • 1
nhahtdh
  • 55,989
  • 15
  • 126
  • 162