Finding repetitions of a string by length

Question

I have a string of letters similar to that shown below:

'ABTSOFDNSOHASAPMAPDSNFAKSGMOMAPEPTNSNTROMAPKSDFANSDHASOMAPDODDFG'

I am treating this as a cipher text and therefore want to begin to find the position of repetitions in order to find the length of the encryption key (the example above is random so no direct answers will come from it)

For now what I want to be able to do is write a code that can find repetitions of length 3 - for example 'MAP' and 'HAS' are repeated. I want the code to find these repetitions as opposed to me having to specify the substring it should look for.

Previously I have used:

text.find("MAP")

Using the answer below I have written:

substring = []
for i in range(len(Phrase)-4):
    substring.append(Phrase[i:i+4])
    
for index, value in freq.iteritems():
    if value > 1:
        for i in range(len(Phrase)-4):
            if index == Phrase[i:i+4]:
                print(index)

This gives a list of each repeated substring as many times as it appears, ideally I want this to just be a list of the substring with the positions it appears in

`'qssssssq'` - how many repetitions of length three are in that string? — wwii, Feb 22 '22 at 16:53
Does [Efficiently find repeated characters in a string](https://stackoverflow.com/questions/25706136/efficiently-find-repeated-characters-in-a-string) answer your question? — wwii, Feb 22 '22 at 16:55
@wwii I found this question and struggled to apply the logic where they are looking at individual characters and I want substrings of a specific length — thomasmaths, Feb 22 '22 at 17:19
@wwii I have also edited the question - thank you for the links, hopefully this makes more sense — thomasmaths, Feb 22 '22 at 17:26

Edoardo Berardo · Accepted Answer · 2022-02-22T19:19:06.823

0

Here what I did :)

import pandas as pd

# find frequency of each length 3 substring
Phrase    = "Maryhadalittlarymbada"
substring = []
for i in range(len(Phrase)-3):
    substring.append(Phrase[i:i+3])
Frequency  = pd.Series(substring).value_counts()

# find repetition's position in string
for index, value in Frequency.iteritems():
    positions = []
    if value > 1:
        for i in range(len(Phrase)-3):
            if index == Phrase[i:i+3]:
                positions.append(i)
        print(index, ": ", positions)
    else:
        continue

edited Feb 22 '22 at 19:19

answered Feb 22 '22 at 17:34

Edoardo Berardo

142
2
9

Thank you for this response! Would this not provide a list of all positions? Not the specific locations of the repetitions? – thomasmaths Feb 22 '22 at 17:41
Ideally I would want the substrings to be identified along with their specific positions – thomasmaths Feb 22 '22 at 17:59
I moved "positions" above the if statemen, so that it prints positions for each specific substrings. Is this what you mean? – Edoardo Berardo Feb 22 '22 at 19:21

wwii · Answer 2 · 2022-02-23T16:50:54.223

Here is a solution using only built-ins

import itertools, collections
text = 'ABTSOFDNSOHASAPMAPDSNFAKSGMOMAPEPTNSNTROMAPKSDFANSDHASOMAPDODDFG'

Make a function that will produce overlapping chunks of three - inspired by the pairwise function.

def three_at_a_time(text):
    '''Overlapping chunks of three.

    text : str
    returns generator
    '''
    a,b,c = itertools.tee(text,3)
    # advance the second and third iterators
    next(b)
    next(c)
    next(c)
    return (''.join(t) for t in zip(a,b,c))

Make a dictionary with the position(s) of each chunk.

triples = enumerate(three_at_a_time(text))
d = collections.defaultdict(list)
for i,triple in triples:
    d[triple].append(i)

Filter the dictionary for chunks that have more than one position.

# repeats = itertools.filterfalse(lambda item: len(item[1])==1,d.items())
repeats = [(k,v) for k,v in d.items() if len(v)>1]

Example:

>>> for chunk in repeats:
...     print(chunk) 
... 
('HAS', [10, 51])
('MAP', [15, 28, 40, 55])
('OMA', [27, 39, 54])
('APD', [16, 56])
>>>

Finding repetitions of a string by length

2 Answers2