Regex to match the longest repeating substring

Question

I'm writing regular expression for checking if there is a substring, that contains at least 2 repeats of some pattern next to each other. I'm matching the result of regex with former string - if equal, there is such pattern. Better said by example: 1010 contains pattern 10 and it is there 2 times in continuous series. On other hand 10210 wouldn't have such pattern, because those 10 are not adjacent.

What's more, I need to find the longest pattern possible, and it's length is at least 1. I have written the expression to check for it ^.*?(.+)(\1).*?$. To find longest pattern, I've used non-greedy version to match something before patter, then pattern is matched to group 1 and once again same thing that has been matched for group1 is matched. Then the rest of string is matched, producing equal string. But there's a problem that regex is eager to return after finding first pattern, and don't really take into account that I intend to make those substrings before and after shortest possible (leaving the rest longest possible). So from string 01011010 I get correctly that there's match, but the pattern stored in group 1 is just 01 though I'd except 101.

As I believe I can't make pattern "more greedy" or trash before and after even "more non-greedy" I can only come whit an idea to make regex less eager, but I'm not sure if this is possible.

Further examples:

56712453289 - no pattern - no match with former string
22010110100 - pattern 101 - match with former string (regex resulted in 22010110100 with 101 in group 1)
5555555 - pattern 555 - match
1919191919 - pattern 1919 - match
191919191919 - pattern 191919 - match
2323191919191919 - pattern 191919 - match

What I would get using current expression (same strings used):

no pattern - no match
pattern 2 - match
pattern 555 - match
pattern 1919 - match
pattern 191919 - match
pattern 23 - match

This question will be easier to answer if you provide a series of examples and what you want the result to be. — John Feminella, Feb 09 '12 at 18:44
@JohnFeminella Given a string, he wants to find the longest possible substring that matches the pattern `(.+)(\1)`. So for the input `yabyababyab`, he'd want to find the substring `abyababyab` (and not `yabyab`). — sepp2k, Feb 09 '12 at 18:49
`/^.*?(.+)(\1).*?$/` is identical to `/(.+)(\1)/`. ( Except that the first one runs slower ) — Brad Gilbert, Feb 09 '12 at 20:31
@BradGilbert true, but only in context of finding repeating substring. The first one will do that AND produce matching string with former string (meaning that if no pattern was found, there won't be match between those two).. But you would be true again if you told that cheking for zero-sized group 1 would be better alternative, it's just how the problem was laid down in first place. — Raven, Feb 10 '12 at 13:38
@Raven The only time my statement is not true is when there is an embedded newline, and you have neither [`/m` or `/s`](http://perldoc.perl.org/perlre.html#Modifiers) set. (Unless you are talking about setting [`$&`](http://perldoc.perl.org/perlvar.html#%24%26), which you probably shouldn't use.) — Brad Gilbert, Feb 14 '12 at 05:08

Qtax · Accepted Answer · 2012-02-09T19:58:39.800

14

In Perl you can do it with one expression with help of (??{ code }):

$_ = '01011010';
say /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;

Output:

What happens here is that after a matching consecutive pair of substrings, we make sure with a negative lookahead that there is no longer pair following it.

To make the expression for the longer pair a postponed subexpression construct is used (??{ code }), which evaluates the code inside (every time) and uses the returned string as an expression.

The subexpression it constructs has the form .+?(..{N,})\1, where N is the current length of the first capturing group (length($^N), $^N contains the current value of the previous capturing group).

Thus the full expression would have the form:

(?=(.+)\1)(?!.+?(..{N,})\2}))

With the magical N (and second capturing group not being a "real"/proper capturing group of the original expression).

Usage example:

use v5.10;

sub longest_rep{
    $_[0] =~ /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;
}

say longest_rep '01011010';
say longest_rep '010110101000110001';
say longest_rep '2323191919191919';
say longest_rep '22010110100';

Output:

edited Feb 09 '12 at 19:58

answered Feb 09 '12 at 19:10

Qtax

33,241
9
83
121

1

That is a stunning regex, and I'd be interested if you would explain it in more detail. However, it does not work for me for the values `2323191919191919` and `22010110100`, where there is a shorter match first. – TLP Feb 09 '12 at 19:15
@TLP, added a description. It works for me, what are you getting? – Qtax Feb 09 '12 at 19:23
`perl -lnwe '/(?=(.+)\1)(?!(??{ ".+?(.{" . (1+length $^N) . ",})\1" }))\1/ && print $1;'` running in the shell, I paste in the numbers the OP gave. Note that I changed the single quotes. – TLP Feb 09 '12 at 19:24
1

seems to work for me too, going to test some other casses but this is quite impressive! – Raven Feb 09 '12 at 19:27
@Qtax It seems to work when used in a script, so I guess my test was off. – TLP Feb 09 '12 at 19:32
@TLP one question.. that description on perlre mentions that this is only experimental. I'm not afraid it will change as you proved it does miracles already, but is there any info about what minimal Perl version is required? – Raven Feb 09 '12 at 19:36
@Raven, it's been there since version 5.6 or 5.8 at least, and still here unchanged. It's not going anywhere any time soon. ;-) (I do `use v5.12;` only for `say`, think that was added in 5.10 tho.) – Qtax Feb 09 '12 at 19:38

Tim Pietzcker · Answer 2 · 2013-07-22T20:29:20.673

2

You can do it in a single regex, you just have to pick the longest match from the list of results manually.

def longestrepeating(strg):
    regex = re.compile(r"(?=(.+)\1)")
    matches = regex.findall(strg)
    if matches:
        return max(matches, key=len)

This gives you (since re.findall() returns a list of the matching capturing groups, even though the matches themselves are zero-length):

>>> longestrepeating("yabyababyab")
'abyab'
>>> longestrepeating("10100101")
'010'
>>> strings = ["56712453289", "22010110100", "5555555", "1919191919", 
               "191919191919", "2323191919191919"]
>>> [longestrepeating(s) for s in strings]
[None, '101', '555', '1919', '191919', '191919']

edited Jul 22 '13 at 20:29

answered Feb 09 '12 at 18:54

Tim Pietzcker

328,213
58
503
561

1

This looks like Python and not Perl. – Pavel Nov 14 '17 at 16:19
2

@Pavel: You're right, I must have overlooked the Perl tag. Funny that you're the first one to notice after over 5 years. I'll leave the answer up anyway since I think it's easy to understand and can be applied in many languages other than Perl as well. – Tim Pietzcker Nov 15 '17 at 11:43

score 0 · Answer 3 · edited Jul 10 '18 at 13:15

Regular expressions can be helpful in solving this, but I don't think you can do it as a single expression, since you want to find the longest successful match, whereas regexes just look for the first match they can find. Greediness can be used to tweak which match is found first (earlier vs. later in the string), but I can't think of a way to prefer an earlier, longer substring over a later, shorter substring while also preferring a later, longer substring over an earlier, shorter substring.

One approach using regular expressions would be to iterate over the possible lengths, in decreasing order, and quit as soon as you find a match of the specified length:

my $s = '01011010';
my $one = undef;
for(my $i = int (length($s) / 2); $i > 0; --$i)
{
  if($s =~ m/(.{$i})\1/)
  {
    $one = $1;
    last;
  }
}
# now $one is '101'

score 0 · Answer 4 · answered Feb 09 '12 at 19:11

Here's a long-ish script that does what you ask. It basically goes through your input string, shortens it by one, then goes through it again. Once all possible matches are found, it returns one of the longest. It is possible to tweak it so that all the longest matches are returned, instead of just one, but I'll leave that to you.

It's pretty rudimentary code, but hopefully you'll get the gist of it.

use v5.10;
use strict;
use warnings;

while (<DATA>) {
    chomp;
    print "$_ : ";
    my $longest = foo($_);
    if ($longest) {
        say $longest;
    } else {
        say "No matches found";
    }
}

sub foo {
    my $num = shift;
    my @hits;
    for my $i (0 .. length($num)) {
        my $part = substr $num, $i;
        push @hits, $part =~ /(.+)(?=\1)/g;
    }
    my $long = shift @hits;
    for (@hits) {
        if (length($long) < length) {
            $long = $_;
        }
    }
    return $long;
}

__DATA__
56712453289
22010110100
5555555
1919191919
191919191919
2323191919191919

score 0 · Answer 5 · answered Feb 10 '12 at 21:35

Not sure if anyone's thought of this...

my $originalstring="pdxabababqababqh1234112341";

my $max=int(length($originalstring)/2);
my @result;
foreach my $n (reverse(1..$max)) {
    @result=$originalstring=~m/(.{$n})\1/g;
    last if @result;
}

print join(",",@result),"\n";

The longest doubled match cannot exceed half the length of the original string, so we count down from there.

If the matches are suspected to be small relative to the length of the original string, then this idea could be reversed... instead of counting down until we find the match, we count up until there are no more matches. Then we need to back up 1 and give that result. We would also need to put a comma after the $n in the regex.

my $n;
foreach (1..$max) {
    unless (@result=$originalstring=~m/(.{$_,})\1/g) {
        $n=--$_;
        last;
    }
}
@result=$originalstring=~m/(.{$n})\1/g;

print join(",",@result),"\n";

Regex to match the longest repeating substring

5 Answers5

Linked

Related