Finding most sequences of specified length

Question

I'm trying to write python code that will take a string and a length, and search through the string to tell me which sub-string of that particular length occurs the most, prioritizing the first if there's a tie.

For example, "cadabra abra" 2 should return ab

I tried:

import sys

def main():
    inputstring = str(sys.argv[1])
    length = int(sys.argv[2])
    Analyze(inputstring, length)    


def Analyze(inputstring, length):
    count = 0;
    runningcount = -1;
    sequence = ""
    substring = ""
    for i in range(0, len(inputstring)):    
        substring = inputstring[i:i+length]
        for j in range(i+length,len(inputstring)):
            #print(runningcount)
            if inputstring[j:j+2] == substring:
                print("runcount++")
                runningcount += 1
                print(runningcount)         
                if runningcount > count:
                    count = runningcount
                    sequence = substring


    print(sequence)             


main()

But can't seem to get it to work. I know I'm at least doing something wrong with the counts, but I'm not sure what. This is my first program in Python too, but I think my problem is probably more with the algorithm than the syntax.

why `if inputstring[j:j+2] == substring:` ? shouldn't be `if inputstring[j:j+length] == substring:` instead? — Iron Fist, Feb 06 '16 at 04:47
Check out this answer to a similar question: http://stackoverflow.com/a/14670769/1795128 . Using the Counter class simplifies the problem a lot — Zack Graber, Feb 06 '16 at 04:47
Yea it would be j+length, thanks. I'll try taking a look at counter class, but I was trying to do this without researching too much python specific stuff yet — Austin, Feb 06 '16 at 04:50
Do you count overlapping strings? E.g., should `'aaabbcbb' 2` return `'aa'` (occurring twice counting the overlap, and beating `'bb'` by occurring earlier), or should it return `'bb'`? Since you accepted Iron Fist's answer below, it looks like you do want to count the overlap, but that's not clear from the problem description. — gil, Feb 06 '16 at 06:46

score 2 · Accepted Answer · edited May 23 '17 at 12:31

2

Try to use built-in method, they will make your life easier, this way:

>>> s = "cadabra abra"
>>> x = 2
>>> l = [s[i:i+x] for i in range(len(s)-x+1)]
>>> l
['ca', 'ad', 'da', 'ab', 'br', 'ra', 'a ', ' a', 'ab', 'br', 'ra']
>>> max(l, key=lambda m:s.count(m))
'ab'

EDIT:

Much simpler syntax as per Stefan Pochmann comment:

>>> max(l, key=s.count)

edited May 23 '17 at 12:31

Community

1
1

answered Feb 06 '16 at 05:00

Iron Fist

10,739
2
18
34

1

`max(l, key=s.count)` – Stefan Pochmann Feb 06 '16 at 05:34
Wow this way was much more simple – Austin Feb 06 '16 at 05:38
Third line should be: `l = [s[i:i+x] for i in range(len(s) - x + 1)]` to adjust the range when `x` changes. – RootTwo Feb 06 '16 at 06:32
You cannot omit the `key=` in `key=s.count`; it's a keyword-only argument. – gil Feb 06 '16 at 06:32

John Gordon · Answer 2 · 2016-02-06T05:23:15.783

1

import sys
from collections import OrderedDict

def main():
    inputstring = sys.argv[1]
    length = int(sys.argv[2])
    analyze(inputstring, length)

def analyze(inputstring, length):
    d = OrderedDict()
    for i in range(0, len(inputstring) - length + 1):    
        substring = inputstring[i:i+length]
        if substring in d:
            d[substring] += 1
        else:
            d[substring] = 1
    maxlength = max(d.values())
    for k,v in d.items():
        if v == maxlength:
            print(k)
            break

main()

edited Feb 06 '16 at 05:23

answered Feb 06 '16 at 05:06

John Gordon

29,573
7
33
58

for my sample input this returns the length not the sequence. also I think the print is missing a ) – Austin Feb 06 '16 at 05:17
Maybe I'm doing something wrong, but this isn't returning anything for me when I use my sample input – Austin Feb 06 '16 at 05:34
1

I used it with your sample input at the top and it works for me. What input are you using? – John Gordon Feb 06 '16 at 05:40
Hmm I'm using CodeRunner with my arguments set to: `"cadabra abra" 2` – Austin Feb 06 '16 at 05:44
Ok it's working, don't know what I did wrong the first time. Thanks a lot – Austin Feb 06 '16 at 05:45

score 0 · Answer 3 · answered Feb 06 '16 at 06:51

Pretty good stab at a solution for a first Python program. As you learn the language, spend some time reading the excellent documentation. It is full of examples and tips.

For example, the standard library includes a Counter class for counting things (obviously) and an OrderedDict class which remebers the ording in which keys are entered. But the documentation includes an example that combines the two to make an OrderedCounter, which can be used to solve you problem like this:

from collections import Counter, OrderedDict

class OrderedCounter(Counter, OrderedDict):
    pass

def analyze(s, n):
    substrings = (s[i:i+n] for i in range(len(s)-n+1))
    counts = OrderedCounter(substrings)
    return max(counts.keys(), key=counts.__getitem__)

analyze("cadabra abra", 2)

Finding most sequences of specified length

3 Answers3