8

How do I write an expression that matches exactly N repetitions of the same character (or, ideally, the same group)? Basically, what (.)\1{N-1} does, but with one important limitation: the expression should fail if the subject is repeated more than N times. For example, given N=4 and the string xxaaaayyybbbbbzzccccxx, the expressions should match aaaa and cccc and not bbbb.

I'm not focused on any specific dialect, feel free to use any language. Please do not post code that works for this specific example only, I'm looking for a general solution.

georg
  • 211,518
  • 52
  • 313
  • 390

8 Answers8

12

Use negative lookahead and negative lookbehind.

This would be the regex: (.)(?<!\1.)\1{N-1}(?!\1) except that Python's re module is broken (see this link).

English translation: "Match any character. Make sure that after you match that character, the character before it isn't also that character. Match N-1 more repetitions of that character. Make sure that the character after those repetitions is not also that character."

Unfortunately, the re module (and most regular expression engines) are broken, in that you can't use backreferences in a lookbehind assertion. Lookbehind assertions are required to be constant length, and the compilers aren't smart enough to infer that it is when a backreference is used (even though, like in this case, the backref is of constant length). We have to handhold the regex compiler through this, as so:

The actual answer will have to be messier: r"(.)(?<!(?=\1)..)\1{N-1}(?!\1)"

This works around that bug in the re module by using (?=\1).. instead of \1. (these are equivalent most of the time.) This lets the regex engine know exactly the width of the lookbehind assertion, so it works in PCRE and re and so on.


Of course, a real-world solution is something like [x.group() for x in re.finditer(r"(.)\1*", "xxaaaayyybbbbbzzccccxx") if len(x.group()) == 4]

Community
  • 1
  • 1
Devin Jeanpierre
  • 92,913
  • 4
  • 55
  • 79
  • Okay. I think I agree with your solution, except for the case of matches at the start of the string. – Louis Wasserman Apr 25 '12 at 17:03
  • @LouisWasserman can you give an example? I couldn't think of any, but that doesn't really say much. EDIT: I suspect your regex engine doesn't let lookbehind assertions fail if they try to go before the string? Sounds flawed. Works fine in Python. – Devin Jeanpierre Apr 25 '12 at 17:06
  • 1
    The first one works in [regex](http://pypi.python.org/pypi/regex), which does support groups in assertions. The second one works in both `regex` and `re`. Thanks! – georg Apr 25 '12 at 18:51
6

I suspect you want to be using negative lookahead: (.)\1{N-1}(?!\1).

But that said...I suspect the simplest cross-language solution is just write it yourself without using regexes.

UPDATE:

^(.)\\1{3}(?!\\1)|(.)(?<!(?=\\2)..)\\2{3}(?!\\2) works for me more generally, including matches starting at the beginning of the string.

Louis Wasserman
  • 191,574
  • 25
  • 345
  • 413
  • This does match 'bbbb' for me (using python.regex). Could you provide a complete working example (in any language)? – georg Apr 25 '12 at 16:25
  • 3
    I think that it matches because it still catches the last 4 b's ... Same with the answer posted by Joe – mgilson Apr 25 '12 at 16:26
  • Hmmmm. `(.)(?!\\1)(.)\\2{3,3}+(?!\\2)` matches `xaaaa` and `zcccc`, which is probably a good sign. Still working on it. – Louis Wasserman Apr 25 '12 at 16:40
2

It is easy to put too much burden onto regular expressions and try to get them to do everything, when just nearly everything will do!

Use a regex to find all substrings consisting of a single character, and then check their length separately, like this:

use strict;
use warnings;

my $str = 'xxaaaayyybbbbbzzccccxx';

while ( $str =~ /((.)\2*)/g ) {
  next unless length $1 == 4;
  my $substr = $1;
  print "$substr\n";
}

output

aaaa
cccc
Borodin
  • 126,100
  • 9
  • 70
  • 144
2

Perl’s regex engine does not support variable-length lookbehind, so we have to be deliberate about it.

sub runs_of_length {
  my($n,$str) = @_;

  my $n_minus_1 = $n - 1;
  my $_run_pattern = qr/
    (?:
       # In the middle of the string, we have to force the
       # run being matched to start on a new character.
       # Otherwise, the regex engine will give a false positive
       # by starting in the middle of a run.
       (.) ((?!\1).) (\2{$n_minus_1}) (?!\2) |
       #$1 $2        $3

       # Don't forget about a potential run that starts at
       # the front of the target string.
           ^(.)      (\4{$n_minus_1}) (?!\4)
       #    $4       $5
    )
  /x;

  my @runs;
  while ($str =~ /$_run_pattern/g) {
    push @runs, defined $4 ? "$4$5" : "$2$3";
  }

  @runs;
}

A few test cases:

my @tests = (
  "xxaaaayyybbbbbzzccccxx",
    "aaaayyybbbbbzzccccxx",
  "xxaaaa",
    "aaaa",
  "",
);

$" = "][";
for (@tests) {
  my @runs = runs_of_length 4, $_;
  print qq<"$_":\n>,
        "  - [@runs]\n";
}

Output:

"xxaaaayyybbbbbzzccccxx":
  - [aaaa][cccc]
"aaaayyybbbbbzzccccxx":
  - [aaaa][cccc]
"xxaaaa":
  - [aaaa]
"aaaa":
  - [aaaa]
"":
  - []

It’s a fun puzzle, but your regex-averse colleagues will likely be unhappy if such a construction shows up in production code.

Greg Bacon
  • 134,834
  • 32
  • 188
  • 245
  • @Greg : I wonder if [`\K`](http://perldoc.perl.org/perlre.html) would solve the variable look-behind need. – Zaid Apr 26 '12 at 10:18
1
>>> import itertools
>>> zz = 'xxaaaayyybbbbbzzccccxxaa'
>>> z = [''.join(grp) for key, grp in itertools.groupby(zz)]  
>>> z  
['xx', 'aaaa', 'yyy', 'bbbbb', 'zz', 'cccc', 'xx', 'aa']

From there you can iterate through the list and check for occasions when N==4 very easily, like this:

>>> [item for item in z if len(item)==4]
['cccc', 'aaaa']
fraxel
  • 34,470
  • 11
  • 98
  • 102
1

How about this in python?

def match(string, n):
    parts = []
    current = None
    for c in string:
        if not current:
            current = c
        else:
            if c == current[-1]:
                current += c
            else:
                parts.append(current)
                current = c

    result = []
    for part in parts:
        if len(part) == n:
            result.append(part)

    return result

Testing with your string with various sizes:

match("xxaaaayyybbbbbzzccccxx", 6) = []
match("xxaaaayyybbbbbzzccccxx", 5) = ["bbbbb"]
match("xxaaaayyybbbbbzzccccxx", 4) = ['aaaa', 'cccc']
match("xxaaaayyybbbbbzzccccxx", 3) = ["yyy"]
match("xxaaaayyybbbbbzzccccxx", 2) = ['xx', 'zz']

Explanation:

The first loop basically splits the text into parts, like so: ["xx", "aaaa", "yyy", "bbbbb", "zz", "cccc", "xx"]. Then the second loop tests those parts for their length. In the end the function only returns the parts that have the current length. I'm not the best at explaining code, so anyone is free to enhance this explanation if needed.

Anyways, I think this'll do!

Alex
  • 5,009
  • 3
  • 39
  • 73
  • Thanks, but I'm looking for a regex-based solution. +1 for the effort. – georg Apr 25 '12 at 17:19
  • You appear to have reimplemented itertools.groupby; so it looks like this'd be better written as `[c for c in ("".join(b) for a,b in itertools.groupby("xxaaaayyybbbbbzzccccxx")) if len(c) == 4]`. – Devin Jeanpierre Apr 25 '12 at 17:35
  • @DevinJeanpierre I didn't now about that, so I had to hardcode it... well! Live and learn! – Alex Apr 25 '12 at 19:59
1

Why not leave to regexp engine what it does best - finding longest string of same symbols and then check length yourself?

In Perl:

my $str = 'xxaaaayyybbbbbzzccccxx';

while($str =~ /(.)\1{3,}/g){
    if(($+[0] - $-[0]) == 4){ # insert here full match length counting specific to language
        print (($1 x 4), "\n")
    }
}
Oleg V. Volkov
  • 21,719
  • 4
  • 44
  • 68
1

In Java we can do like below code

String test ="xxaaaayyybbbbbzzccccxx  uuuuuutttttttt";

int trimLegth = 4; // length of the same characters

Pattern p = Pattern.compile("(\\w)\\1+",Pattern.CASE_INSENSITIVE| Pattern.MULTILINE);

Matcher m = p.matcher(test);
while (m.find())
{ 
    if(m.group().length()==trimLegth) {
        System.out.println("Same Characters String " + m.group());
    }
}
georg
  • 211,518
  • 52
  • 313
  • 390
Uttesh Kumar
  • 290
  • 2
  • 10