1

I'm not able to get the number of occurrences of a substring that has n-lenght in a string. For example if the string is

CCCATGGTtaGGTaTGCCCGAGGT

and n is

3

The output must be something like :

'CCC' : 2, 'GGT' :3

The input is a list of lists so I get evry string of list but Im not able to go ahead and the output is the dic of all strings

Code:

def get_all_n_repeats(n,sq_list):
    reps={}
    for i in sq_list:
        if not i:
            continue
        else:   
            for j in i:
                ........#Here the code I want to do#......                  
return reps
Teshtek
  • 1,212
  • 1
  • 12
  • 20
  • Why is it `GGT` and not `GTt`? – Burhan Khalid May 29 '16 at 19:52
  • You need to at least show something you have tried. – totoro May 29 '16 at 19:55
  • Your output and your input don't make sense. If you split your input string into three letter strings, you get `['CCC', 'ATG', 'GTt', 'aGG', 'TaT', 'GCC', 'CGA', 'GGT']` so I don't know where you got `GGT` in your output. – Burhan Khalid May 29 '16 at 19:57
  • 4
    What is so unclear about this question? It makes perfect sense. – Jivan May 29 '16 at 20:03
  • 1
    @BurhanKhalid I think his candidates are `['CCC', 'CCA', 'CAT', 'ATG', 'TGG', 'GGT', 'GTt', 'Tta', 'taG', 'aGG', 'GGT', 'GTa', 'TaT', 'aTG', 'TGC', 'GCC', 'CCC', 'CCG', 'CGA', 'GAG', 'AGG', 'GGT']`. – totoro May 29 '16 at 20:04
  • @BurhanKhalid Sensitive case so `GGT` and not `GTt`, you dont have to split string in three letter string and check if are equals.. you have to look for substring repeting,so some combination @Nightcrawler, you are right – Teshtek May 29 '16 at 21:20

3 Answers3

2

A really simple solution:

from collections import Counter

st = "CCCATGGTtaGGTaTGCCCGAGGT"
n = 3

tokens = Counter(st[i:i+n] for i in range(len(st) - n + 1))
print tokens.most_common(2)

After it is up to you to make it a helper function.

DevLounge
  • 8,313
  • 3
  • 31
  • 44
1

A very explicit solution:

s = 'CCCATGGTtaGGTaTGCCCGAGGT'
n = 3
# All possible n-length strings
l = [s[i:i + n] for i in range(len(s) - (n - 1))]
# Count their distribution
d = {}
for e in l:
    d[e] = d.get(e, 0) + 1
print(d)
totoro
  • 2,469
  • 2
  • 19
  • 23
0

Use Counter

from collections import Counter

def count_occurrences(input, n):
    candidates = []
    for i, c in enumerate(st):
        try:
            candidates.append('{}{}{}'.format(st[i], st[i+1], st[i+2]))
        except IndexError:
            continue

    output = {}
    for k,v in Counter(candidates).items():
        if v > 1:
            output[k] = v

st = "CCCATGGTtaGGTaTGCCCGAGGT"
n = 3

count_occurrences(st, n)
# {'GGT': 3, 'CCC': 2}
Jivan
  • 21,522
  • 15
  • 80
  • 131