Python : Regex, Finding Repetitions on a string

Question

I need to find repetitions in a text string. I already found a very nice elegant solution here from @Tim Pietzcker

I am happy with the solution as is but would like to know whether it's possible to extend it little further such that it would accept a string with whitespaces.

For example "a bcab c" would return [(abc,2)]

I tried using the regex pattern "([^\s]+?)\1+") with no luck. Any help is much appreciated.

if in python, you could simply do `no_whitespaces = input_str.replace(" ","")` and then do your regex on `no_whitespaces` — e.s., Mar 21 '19 at 01:15
Hi e.s, That is one possibility but my application is to find the patterns on a bigger text structure. so whenever possible would like to keep the spaces between them because I am planning to highlight the found text once the match is made — XYZ, Mar 21 '19 at 02:54
If you want to highlight the found text once the match is made, as per your above example the output should be [(a bc,2)] ? If not, how are you going to highlight the text once the match is made? — sanooj, Mar 21 '19 at 05:00

sanooj · Accepted Answer · 2019-03-21T03:53:42.847

You should think about removing " " from the text first. You can do it by regex itself.

>>> def repetitions(s):
...    r = re.compile(r"(.+?)\1+")
...    for match in r.finditer(re.sub(r'\s+',"",s)):
...        yield (match.group(1), len(match.group(0))/len(match.group(1)))
...

Output.

>>> list(repetitions("a bcab c"))
[('abc', 2)]

If you still want to retain the space in the original text, Try this regex: r"(\s*\S+\s*?\S*?)\1+" . But this has limitations.

>>> def repetitions(s):
...    r = re.compile(r"(\s*\S+\s*?\S*?)\1+")
...    for match in r.finditer(s):
...        yield (match.group(1), len(match.group(0))/len(match.group(1)))
...

Results:

>>> list(repetitions(" abc abc "))
[(' abc', 2)]
>>> list(repetitions("abc abc "))
[('abc ', 2)]
>>> list(repetitions(" ab c ab c "))
[(' ab c', 2)]
>>> list(repetitions("ab cab c "))
[('ab c', 2)]
>>> list(repetitions("blablabla"))
[('bla', 3)]

Thanks Sanooj, I ended up replacing spaces and then matching the group back in with with a newly compiled regex with added spaces. For example the match "abc", will be fed into a new regex with "\s*".join('abc'). Thanks, heaps for your time again. — XYZ, Mar 21 '19 at 23:13

score 0 · Answer 2 · answered Mar 21 '19 at 01:58

0

Using (\S+ ?\S?)\1, you can make it tolerable to spaces for strings as below where the positions of the spaces are in the same location in the repetetive words ab c.

ab cab c

However, if the space locations in the repetitive words are not the same. Then it means, you have to replace the meaningless spaces with an empty string "" to find the repetitive words with your approach.

answered Mar 21 '19 at 01:58

Fatih Aktaş

1,446
13
25

Hi Faith , Thanks heaps for your input, but spaces are irregular as shown in my example – XYZ Mar 21 '19 at 02:53

Python : Regex, Finding Repetitions on a string

2 Answers2