3

Besides strings of any number of a's and b's like aa.. or bb.. ,Would regular expression (a*+b*) contain a string like

ab

or any string ending with b ?

Is (a*+b*) same as (a* b*) ?

I am a little bit confuse about the strings generated by regular expression (a*+b*) and would really appreciate if someone can help.

Deen John
  • 3,522
  • 4
  • 29
  • 32
  • 1
    Sorry I don't think a*+ is a valid regular expression nor a language expression in computation theory (sorry I've only take one degree class in the University about this topic). If you want one that express one or more 'a' followed by one or more 'b', it should be a+b+. If you allows only 'a' without 'b' and only 'b' without 'a', or even an empty string, it will be a*b*. So... – Ken Cheung Dec 12 '15 at 14:19
  • `*+` is a possessive greedy quantifier. Have fun: http://www.regular-expressions.info/possessive.html – hek2mgl Dec 12 '15 at 14:22
  • Thanks for the reading, and pretty new for me. I use regex fully base on FA's concept. – Ken Cheung Dec 12 '15 at 14:55
  • 4
    @hek2mgl:it is possible that your interpretation is correct, but i suspect that OP is using a formal languages textbook and `+` is being used as the disjunction operator ("or") which is common in maths. – rici Dec 12 '15 at 15:01
  • If its `a*` _or_ `b*` the equivalent expression is `a*|b*`. If this is the only expression, this would match a substring of all `a`'s or all `b`'s but not both. If its `a*b*` this would have the same affect as the other one with the addition there could be a mix of `a`'s then `b`'s. This expression `a*+b*` as a regex uses a slightly complex operator, where the `+` is a modifier of a quantifier. In this case it tells the backtracking part of the engine to not give any `a`'s back after it matches. This is an advanced topic and probably not what you intend. –  Dec 13 '15 at 02:30
  • I was in the middle of elaborating an answer, when I got stuck [here](http://stackoverflow.com/questions/34495675/zero-length-regexes-and-infinitely-matching) :) – Veverke Dec 28 '15 at 14:57
  • Thanks to you I learned new stuff ;) – Veverke Dec 28 '15 at 15:06
  • `a*+` is perfectly valid. `a*` a regex, to which the postfix `+` applies, because if `R` is any regex whatsoever then `R+` means one or more of that. Of course, a given regex language can explicitly stipulate that `*+` is treated as a token, which either has no assigned meaning (reserved for future use) or given some special meaning. – Kaz Jan 27 '16 at 23:12
  • And note, people, that even in regex dialects in which `*+` is a special thing, you can nevertheless write `(a*)+`!!! – Kaz Jan 27 '16 at 23:15

2 Answers2

4

Unless you're working with a regex language which explicitly classifies the *+ as a special token which either has a special meaning, or is reserved for future extension (and produces defined behavior now, or a syntax error), the natural parse of a*+ is that it means (a*)+: the postfix + is applied to the expression a*.

If that interpretation applies, next we can observe that (a*)+ is equivalent to just a*. Therefore a*+b* is the same as a*b*.

Firstly, by definition R+ means RR*. Match one R and then zero or more of them. Therefore, we can rewrite (a*)+ as (a*)(a*)*.

Secondly, * is idempotent, so (a*)* is is just (a*). If we match "zero or more a", zero or more times, nothing changes; the net effect is zero or more a. Proof: R* denotes this infinite expansion: (|R|RR|RRR|RRRR|RRRRR|...): match nothing, or match one R, or match two R's, ... Therefore, (a*)* dentes this expansion: (|a*|a*a*|a*a*a*|...). These inner a*-s in turn denote individual second-level expansions: (|(|a|aa|aaa|...|)|(|a|aa|aaa|...)(a|a|aaa|...))|...). By the associative property of the branch |, we can flatten a structure like (a|(b|c)) into (a|b|c), and when we do this to the expansion, we note that there are numerous identical terms—the empty regex (), the single a, the double aa and so on. These all reduce to a single copy, because (|||) is equivalent to () and (a|a|a|a|...) is equivalent to just (a) and so on. That is to say, when we sort the terms by increasing length, and squash multiple identical terms to just one copy, we end up with (|a|aa|aaa|aaaa|...), which is recognizable as the expansion of just a*. Thus (a*)* is a*.

Lastly, (a*)(a*) just means a*. Proof: Similarly to the previous, we expand into branches: (|a|aa|aaa|...)(|a|aa|aaa|...). Next we note that the catenation of branch expressions is equivalent to a Cartesian Product set of the terms. That is to say (a|b|c|..)(i|j|k|...) means, precisely: (ai|aj|ik|...|bi|bj|bk|...|ci|cj|ck|...|...). When we apply this product to (|a|aa|aaa|...)(|a|aa|aaa|...) we get a proliferation of terms, which, when arranged in increasing length and subject to de-duplication, reduce to (|a|aa|aaa|aaaa|...), and that is just a*.

Kaz
  • 55,781
  • 9
  • 100
  • 149
1

I think it helps here to look at a formal definition of regular expressions, i.e. to look for each regular expression e which language L(e) does it produce.

So let's start simple:

(1) What about the regexp a (only the letter)? Its language is

 L(a) := {a},

just the single word/character "a".

(2) For the regexp e1 + e2, where e1 and e2 are regexps themselves,

L(e1 + e2) := L(e1) U L(e2).

So e.g. if a and b are characters, L(a+b) = {a, b}.

(3) For the regexp e1 e2 (concatenation), where e1 and e2 are regexps themselves,

L(e1 e2) := all words w such that 
we can write w = w_1w_2 with w_1 in L(e1) and w_2 in L(e2)".

(4) What about a regular expression *e**, where e might be a regular expression itself? Intuitively, a word is in L(e*) if it has the form w_1 w_2w_3w_4...w_n, with w_i in L(e) for each i. So

L(e*) := all words w such that we can write 
         w = w_1 w_2 .. w_n 
           for a n >= 0 with all w_i in L(e) (for i = 1, 2, ..., n)

So, what about L((a* + b*))?

L((a* + b*)) 
(according to rule 2)
= L(a*) U L(b*)
(according to rule 4/1)
= {eps, a, aa, aaa, aaaa, ....} U {eps, b, bb, bbb, bbbb}
= all strings that have either only a's OR only b's in it 
  (including eps, the so-called empty word)

Similarly for (a* b*):

 L((a* b*))
 (according to rule 3)
 = all words w = w_1 w_2 with w_1 in L(a*) and w_2 in L(b*)
 = {eps eps, eps b, a eps, ab, aa eps, aab, ...}
 = {eps, b, a, ab, aa, aab, aabb, ... }
 = all strings that first have zero or more a's, then zero or more b's.

For the beginning I think it helps to "deconstruct" the regular expression, as we did above - since regular expressions can also be seen as trees, just like the more known arithmetic expressions, for example:

    +
  /   \
 *     *
 |     |
 a     b
Pachelbel
  • 523
  • 3
  • 12