3

I need to find repetitions in a text string. I already found a very nice elegant solution here from @Tim Pietzcker

I am happy with the solution as is but would like to know whether it's possible to extend it little further such that it would accept a string with whitespaces.

For example "a bcab c" would return [(abc,2)]

I tried using the regex pattern "([^\s]+?)\1+") with no luck. Any help is much appreciated.

muru
  • 4,723
  • 1
  • 34
  • 78
XYZ
  • 310
  • 2
  • 12
  • 2
    if in python, you could simply do `no_whitespaces = input_str.replace(" ","")` and then do your regex on `no_whitespaces` – e.s. Mar 21 '19 at 01:15
  • Hi e.s, That is one possibility but my application is to find the patterns on a bigger text structure. so whenever possible would like to keep the spaces between them because I am planning to highlight the found text once the match is made – XYZ Mar 21 '19 at 02:54
  • If you want to highlight the found text once the match is made, as per your above example the output should be [(a bc,2)] ? If not, how are you going to highlight the text once the match is made? – sanooj Mar 21 '19 at 05:00

2 Answers2

1

You should think about removing " " from the text first. You can do it by regex itself.

>>> def repetitions(s):
...    r = re.compile(r"(.+?)\1+")
...    for match in r.finditer(re.sub(r'\s+',"",s)):
...        yield (match.group(1), len(match.group(0))/len(match.group(1)))
... 

Output.

>>> list(repetitions("a bcab c"))
[('abc', 2)]

If you still want to retain the space in the original text, Try this regex: r"(\s*\S+\s*?\S*?)\1+" . But this has limitations.

>>> def repetitions(s):
...    r = re.compile(r"(\s*\S+\s*?\S*?)\1+")
...    for match in r.finditer(s):
...        yield (match.group(1), len(match.group(0))/len(match.group(1)))
... 

Results:

>>> list(repetitions(" abc abc "))
[(' abc', 2)]
>>> list(repetitions("abc abc "))
[('abc ', 2)]
>>> list(repetitions(" ab c ab c "))
[(' ab c', 2)]
>>> list(repetitions("ab cab c "))
[('ab c', 2)]
>>> list(repetitions("blablabla"))
[('bla', 3)]
sanooj
  • 493
  • 5
  • 12
  • Thanks Sanooj, I ended up replacing spaces and then matching the group back in with with a newly compiled regex with added spaces. For example the match "abc", will be fed into a new regex with "\s*".join('abc'). Thanks, heaps for your time again. – XYZ Mar 21 '19 at 23:13
0

Using (\S+ ?\S?)\1, you can make it tolerable to spaces for strings as below where the positions of the spaces are in the same location in the repetetive words ab c.

ab cab c 

However, if the space locations in the repetitive words are not the same. Then it means, you have to replace the meaningless spaces with an empty string "" to find the repetitive words with your approach.

Fatih Aktaş
  • 1,446
  • 13
  • 25
  • Hi Faith , Thanks heaps for your input, but spaces are irregular as shown in my example – XYZ Mar 21 '19 at 02:53