3

There are many questions which ask to find the longest repeating substring:

But these don't exactly match my requirements, which are:

  • The substrings may not overlap (they are disjoint).
  • The substrings are allowed to be non-adjacent.
  • Any characters are allowed.
  • I would like to match the longest pattern like this.

So far I have this:

>>> m = re.match(".*(?P<grp>.+).*(?P=grp).*", "dhiblhip")
>>> m.group('grp')
'i'

I think this is matching the last substring which repeats itself, 'i', but that's certainly not the longest one. I'd expect the following output for the following input:

  • '123abc' -> ''
  • 'hh' -> 'h'
  • 'hihi' -> 'hi'
  • 'dhiblhip' -> 'hi'
  • 'phiblhip' -> 'hi' (note how I do not return 'p' since it is not as long as 'hi' even though it is a repeating disjoint substring.)
  • 'racecaracecar' -> 'raceca' (note how I can't recycle the middle r.) In this case, 'acecar' is just as acceptable.

I am using Python's re and would like to continue to do so, but answers in another language are not unwelcome.

Community
  • 1
  • 1
2rs2ts
  • 10,662
  • 10
  • 51
  • 95
  • 1
    @2rs2ts [Here's a solution in PHP](https://eval.in/145703). Also an [online regex demo](http://regex101.com/r/kJ2bW8), it's a bit confusing but you need to take a look at the information section. – HamZa May 02 '14 at 22:59
  • [Small correction](https://eval.in/145704) in the regex – HamZa May 02 '14 at 23:09
  • 2
    I would like to add to @HamZa's answer that it is not possible to find the largest with only regex. If you want to do it in python use something like: `matches = re.findall(r'(.+)(?=.*\1)',yourstring)` `largest = '' if not matches else max(matches,key=lambda m:len(m))` – KSab May 02 '14 at 23:16
  • @KSab Write that up in an answer so I can upvote it, and chances are unless you're wrong about doing it with just regex I'll accept it! – 2rs2ts May 02 '14 at 23:37

1 Answers1

2

Credit to @HamZa for the actual regex: (.+)(?=.*\1). This basically finds a capturing group with at least one character, and then does a non-capturing forward lookahead to make sure it repeats (that way there isn't trouble with python not finding overlapping matches).

While it is not possible to find the largest with regex alone, it is pretty simple to write

matches = re.findall(r'(.+)(?=.*\1)',yourstring)
largest = '' if not matches else max(matches,key=lambda m:len(m))
KSab
  • 502
  • 2
  • 12