3

I have a sentence like so:

s = " foo hello hello hello I am a big mushroom a big mushroom hello hello bye bye bye bye foo"

I would like to find all the consecutive repetitions of sequences of words and the number of times each sequence is repeated. For the example above:

[('hello', 3), ('a big mushroom', 2), ('hello', 2), ('bye', 4)]

I have a solution that almost works for words of only one character based on regexp but I can't extend it to the case of real words:

def count_repetitions(sentence):
    return [(list(t[0]),''.join(t).count(t[0])) for t in re.findall(r'(\w+)(\1+)', ''.join(sentence))]

 l=['x', 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'g', 'h', 'i', 'i', 'i', 'i', 'a', 'b', 'c', 'd']
 count_repetitions(sentence)
 >>> [(['a', 'b', 'c'], 3), (['g', 'h'], 2), (['i', 'i'], 2)]

Note that i would like (['i'], 4) for the last element.

Each word is separated by a space character.

Thomas Reynaud
  • 966
  • 3
  • 8
  • 19
  • 6
    It looks like you want us to write some code for you. While many users are willing to produce code for a coder in distress, they usually only help when the poster has already tried to solve the problem individually. A good way to show this effort is to include a [Minimal, complete, verifiable example](http://stackoverflow.com/help/mcve). Check the [intro tour](https://stackoverflow.com/tour) you finished before posting, especially [How to Ask](http://stackoverflow.com/help/how-to-ask). – Prune Aug 13 '18 at 16:41
  • 1
    Looking forward to it. This is a non-trivial problem from foundations of computing; the simpler grammar forms (e.g. regex) *cannot* handle repetitions of arbitrary length. A Turing machine has the capability, but it's not locally straightforward. – Prune Aug 13 '18 at 16:45

2 Answers2

6

This can be accomplished with a regex with the help of capturing groups.

You can in general catch a repeated pattern with a regexp looking like this: (pattern)\1+. What this does is recursively tries to match a pattern that is followed by itself at least once.

To adapt it to your problem, we only need to take into account that you want words to be separated by a space character. Here is our new regexp: \b((.+?)(?:\s\2)+).

(        # open a group to capture the whole expression, GROUP 1
  (      # open a group to capture the repeated token, GROUP 2
    \b   # boundary metacharacters ensure the token is a whole word
    .+?  # matches anything non-greedily
    \b
  )
  (?:    # open a non-capturing group for the repeated terms
    \s   # match a space
    \2   # match anything matched by GROUP 2
  )+     # match one time or more
 )

Then using re.findall we can find all such patterns and evaluate their number of repetition.

Code

import re

def find_repeated_sequences(s):
    match = re.findall(r'((\b.+?\b)(?:\s\2)+)', s)
    return [(m[1], int((len(m[0]) + 1) / (len(m[1]) + 1))) for m in match]

Note: the formula (len(m[0]) + 1) / (len(m[1]) + 1) assumes the text is only single-spaced and comes from solving the equation:

lengthtotal = count x (lengthel + 1) - 1

Example

s = " foo hello hello hello I am a big mushroom a big mushroom hello hello bye bye bye bye"
print(find_repeated_sequences(s))

Output

[('hello', 3), ('a big mushroom', 2), ('hello', 2), ('bye', 4)]
Olivier Melançon
  • 21,584
  • 4
  • 41
  • 73
  • That's exactly what I was looking for, thanks a lot! – Thomas Reynaud Aug 14 '18 at 10:00
  • After further testing I found this problematic behavior. For instance the string `"button on Internet Explorer"` returns `[('on', 2)]`. How can I modify your solution so it only looks for words (all separated by a single space)? Same problem for single repeating characters: `"Sorry you're having troubles"` returns `[('y', 2)]`. Sorry I didn't think about this edge case in my question.. – Thomas Reynaud Aug 16 '18 at 10:17
  • Looks like `r'((\b.+?)(?:\s\2\b)+)'` does the job, I'm not sure it's the cleanest solution. – Thomas Reynaud Aug 16 '18 at 10:37
  • @ThomasReynaud This is a very good catch, you are correct that using the \b metacharacter is the way to solve it. Although, you should use it in the group 2 to ensure that backtrackign is only done on whole words, see udpated answer – Olivier Melançon Aug 16 '18 at 12:30
-2

Assuming that every word in the string is split by space

stringList = s.split(" ")
stringOccurrence = {}
for index in range(0, len(stringList)):


    if stringList[index] not in stringOccurrence.keys():
        stringOccurrence[stringList[index]] = [index]

    else:
        val =  stringOccurrence[stringList[index]]
        val.append(index)
print(stringOccurrence)

would give:

{'I': [4],
 'a': [6, 9],
 'am': [5],
 'big': [7, 10],
 'bye': [14, 15, 16, 17],
 'foo': [0, 18],
 'hello': [1, 2, 3, 12, 13],
 'mushroom': [8, 11]}

and now you iterate through the key, list of value pairs, and look for consecutive numbers:

The below code was obtained from user:39991 (truppo) from this question What this code does is able to find the consecutive integers in a list. Since our value is a list of integers, we pass each value to this function to identify the consecutive portions of it.

def group(L):
    first = last = L[0]
    for n in L[1:]:
        if n - 1 == last: # Part of the group, bump the end
            last = n
        else: # Not part of the group, yield current group and start a new
            yield first, last
            first = last = n
    yield first, last # Yield the last group

Create a set that stores the result

resultSet = set()
for key, value in stringOccurrence.items():

    for tup in list(group(value)):

        resultSet.add((key, tup[1] - tup[0] + 1))

print(resultSet)

should give you:

{('bye', 4), 
 ('am', 1), 
 ('foo', 1), 
 ('I', 1), 
 ('hello', 3), 
 ('hello', 2), 
 ('a', 1), 
 ('mushroom', 1), 
 ('big', 1)}
ababuji
  • 1,683
  • 2
  • 14
  • 39