2

How can I write regular expression in which whenever there is 0000 there should be 1111 after this for example:

00101011000011111111001111010 -> correct
0000110                       -> incorect
11110                         -> correct

thanks for any help

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
gruber
  • 28,739
  • 35
  • 124
  • 216

3 Answers3

3

If you are using Perl, you can use a zero-width negative-lookahead assertion:

#!/usr/bin/perl

use strict; use warnings;

my @strings = qw(
    00101011000011111111001111010
    00001111000011
    0000110
    11110
);

my $re = qr/0000(?!1111)/;

for my $s ( @strings ) {
    my $result = $s =~ $re ? 'incorrect' : 'correct';
    print "$s -> $result\n";
}

The pattern matches if there is a string of 0000 not followed by at least four 1s. So, a match indicates an incorrect string.

Output:

C:\Temp> s
00101011000011111111001111010 -> correct
00001111000011 -> incorrect
0000110 -> incorrect
11110 -> correct
Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
  • @Johannes Updated my answer. Yes, `00001111000011` is marked as incorrect. – Sinan Ünür Jun 19 '10 at 23:21
  • 1
    Argh, sorry. Now I see. By negating the match condition you're saving yourself a whole lot of pain. Nice one. I was desperately trying my own RE to match repeatedly but the invalid condition matches were always eaten by the filler code :-) – Joey Jun 19 '10 at 23:22
1

While some languages' alleged "regular expressions" actually implement something quite different (generally a superset of) what are called regular expressions in computer science (including e.g. pushdown automatas or even arbitrary code execution within "regexes"), to answer in actual regex terms is, I think, best done as follows:

regular expressions are in general a good way to answer many questions of the form "is there any spot in the text in which the following pattern occurs" (with limitations on the pattern, of course -- for example, balancing of nested parentheses is beyond regexes' power, although of course it may well not be beyond the power of arbitrary supersets of regexes). "Does the whole text match this pattern" is obviously a special case of the question "does any spot in the text match this pattern", given the possibility to have special markers meaning "start of text" and "end of text" (typically ^ and $ in typical regex-pattern syntax).

However, the question "can you check that NO spot in the text matches this pattern" is not an answer which regex matching can directly answer... but, adding (outside the regex) the logical operation not obviously solves the problem in practice, because "check that no spot matches" is clearly the same "tell me if any spot matches" followed by "transform success into failure and vice versa" (the latter being the logical not part). This is the key insight in Sinan's answer, beyond the specific use of Perl's negative-lookahead (which is really just a shortcut, not an extension of regex power per se).

If your favorite language for using regexes in doesn't have negative lookahead but does have the {<number>} "count shortcut", parentheses, and the vertical bar "or" operation:

00001{0,3}([^1]|$)

i.e., "four 0s followed by zero to three 1s followed by either a non-1 character or end-of-text" is exactly a pattern such that, if the text matches it anywhere, violates your constraint (IOW, it can be seen as a slight expansion of the negative-lookahead shortcut syntax). Add a logical-not (again, in whatever language you prefer), and there you are!

Community
  • 1
  • 1
Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
0

There are several approaches that you can take here, and I will list some of them.

Checking that for all cases, the requirement is always met

In this approach, we simply look for 0000(1111)? and find all matches. Since ? is greedy, it will match the 1111 if possible, so we simply check that each match is 00001111. If it's only 0000 then we say that the input is not valid. If we didn't find any match that is only 0000 (perhaps because there's no match at all to begin with), then we say it's valid.

In pseudocode:

FUNCTION isValid(s:String) : boolean
   FOR EVERY match /0000(1111)?/ FOUND ON s
      IF match IS NOT "00001111" THEN
         RETURN false

   RETURN true

Checking that there is a case where the requirement isn't met (then oppose)

In this approach, we're using regex to try to find a violation instead, Thus, a successful match means we say the input is not valid. If there's no match, then there's no violation, so we say the input is valid (this is what is meant by "check-then-oppose").

In pseudocode

isValid := NOT (/violationPattern/ FOUND ON s)

Lookahead option

If your flavor supports it, negative lookahead is the most natural way to express this pattern. Simply look for 0000(?!1111).

No lookahead option

If your flavor doesn't support negative lookahead, you can still use this approach. Now the pattern becomes 00001{0,3}(0|$). That is, we try to match 0000, followed by 1{0,3} (that is, between 0-3 1), followed by either 0 or the end of string anchor $.

Fully spelled out option

This is equivalent to the previous option, but instead of using repetition and alternation syntax, you explicitly spell out what the violations are. They are

00000|000010|0000110|00001110|
0000$|00001$|000011$|0000111$

Checking that there ISN'T a case where the requirement isn't met

This relies on negative lookahead; it's simply taking the previous approach to the next level. Instead of:

isValid := NOT (/violationPattern/ FOUND ON s)

we can bring the NOT into the regex using negative lookahead as follows:

isValid := (/^(?!.*violationPattern)/ FOUND ON s)

That is, anchoring ourself at the beginning of the string, we negatively assert that we can match .*violationPattern. The .* allows us to "search" for the violationPattern as far ahead as necessary.


Attachments

Here are the patterns showcased on rubular:

The input used is (annotated to show which ones are valid):

+ 00101011000011111111001111010
- 000011110000
- 0000110
+ 11110
- 00000
- 00001
- 000011
- 0000111
+ 00001111

References

polygenelubricants
  • 376,812
  • 128
  • 561
  • 623