6

Having some arbitrary string such as

hello hello hello I am I am I am your string string string string of strings

Can I somehow find repetitive sub-strings delimited by spaces(EDIT)? In this case it would be 'hello', 'I am' and 'string'.

I have been wondering about this for some time but I still can not find any real solution. I also have read some articles concerning this topic and hit up on suffix trees but can this help me even though I need to find every repetition e.g. with repetition count higher than two?

If it is so, is there some library for python, that can handle suffix trees and perform operations on them?

Edit: I am sorry I was not clear enough. So just to make it clear - I am looking for repetitive sub-strings, that means sequences in string, that, for example, in terms of regular expressions can be substituted by + or {} wildcards. So If I would have to make regular expression from listed string, I would do

(hello ){3}(I am ){3}your (string ){4}of strings 
Jendas
  • 3,359
  • 3
  • 27
  • 55
  • possible duplicate of [Find longest repetitive sequence in a string](http://stackoverflow.com/questions/11090289/find-longest-repetitive-sequence-in-a-string) – fsw Aug 31 '13 at 18:21
  • I think that so. I have actually read that question before I have posted this and I did not came up with any idea, how to convert the solution to be suitable for my problem. – Jendas Aug 31 '13 at 18:38
  • True, I was focusing only on the output I really wanted. Sorry about that. – Jendas Aug 31 '13 at 18:50
  • 1
    Not sure if you mean something like [this](http://regex101.com/r/iC3sQ4), the result is in the named group "result". Of course `of strings` is missing ... – HamZa Aug 31 '13 at 18:52
  • @Hyperboreus ``m I a`` isn't a repetitive sub-string **delimited by spaces** – eyquem Sep 01 '13 at 11:11
  • @eyquem Please look more closely: My question if from 20 hours ago, and the edit to the post is from 4 hours ago. – Hyperboreus Sep 01 '13 at 15:27
  • @Jendas It is not advisable to change the question after you get answers. Doing so leads SO ad absurdum. – Hyperboreus Sep 01 '13 at 15:29
  • @Hyperboreus OK. Excuse me. I see you then have deleted your answer and your comment. I will delete my own comments after you have read this one – eyquem Sep 01 '13 at 16:08
  • My apologies. I did not mean to cause any misunderstanding. I won't do that again. – Jendas Sep 01 '13 at 16:31
  • @Jendas Just bear in mind that a lot of people search for solutions on SO, not by asking questions, but by reading the answers to questions of other users. This value gets lost by changing the question and hence making the answers unrelated. – Hyperboreus Sep 01 '13 at 17:44

1 Answers1

3

To find two or more characters that repeat two or more times, each delimited by spaces, use:

(.{2,}?)(?:\s+\1)+

Here's a working example with your test string: http://bit.ly/17cKX62

EDIT: made quantifier in capture group reluctant by adding ? to match shortest possible match (i.e. now matches "string" and not "string string")

EDIT 2: added a required space delimiter for cleaner results

Ray Waldin
  • 3,217
  • 1
  • 16
  • 14
  • 1
    Works for his case, but I would make the .{2,} non-greedy otherwise it will match "a a " in "a a a a b". – jaytea Sep 01 '13 at 07:06
  • Right. As it is, it is matching "string string", not "string" – Ray Waldin Sep 01 '13 at 07:09
  • Wow, works like magic! Just before I accept your answer, would you mind explaining the regular expression a little bit? I understand why we have (.{2,}?), but the following bracket? "?:" means do not remember, \s+ is clear enough but \1 ? Does this say "Take what you have found from group No.1 and find it again?" – Jendas Sep 01 '13 at 10:26
  • (?: ...) is a non capturing group. It's just like (...) except the match is not remembered or accessible. In this case either (?:...) or (...) would have worked, but out of habit, I always make groups non-capturing if I don't need to capture them. The first group (.{2,}?) is captured and the backreference to it (\1) ensures that only strings that are repeated are matched. Here's a tutorial on groups and capturing and backreferences: http://www.regular-expressions.info/brackets.html – Ray Waldin Sep 01 '13 at 17:26