Count specific groups of letters in python without excluding them

Question

i'm trying to find a way to have python count a specific subset of a string that is different from the usual str.count("X")

Here is my general example. My variable is dna="AAAAA" My goal is to recognize each set of "AA" that exist in the string. When I run dna.count("AA")I get the predictable result of 2 when I print.

However, the result I am looking for is an output of 4. Here is an image to show visually what I am saying. (I would insert the image, but I do not have the 10 reputation required, so I must post a link) https://docs.google.com/drawings/d/16IGo3hIstcNEqVid8BI6uj09KX4MWWAzSuQcu8AjSu0/edit?usp=sharing

I have been unable to find a satisfactory solution to this problem elsewhere. Probably because i'm not sure what to call my problem. EDIT: I was informed this is counting overlapping substrings.

The matter becomes more complicated, as the full program will not have a single repeated letter in the string, but rather 4 letters (ATCG) repeated at random for undetermined lengths. Here is an example dna="AtcGTgaTgctagcg"I would need the script to out put how many pairs of AT, TC,CG,TG, etc. that exist. While moving one letter incrementally to the right.

Thank you.

This might help: http://stackoverflow.com/questions/11430863/how-to-find-overlapping-matches-with-a-regexp — Azee, May 17 '15 at 18:02

Stefan Pochmann · Answer 1 · 2015-05-17T18:37:24.283

For the easiest case, pairs:

dna = 'AtcGTgaTgctagcg'.upper()
from collections import Counter
for (a, b), ctr in Counter(zip(dna, dna[1:])).items():
    print(a + b, ctr)

Prints:

CT 1
TC 1
TA 1
GA 1
AG 1
CG 2
GC 2
GT 1
AT 2
TG 2

For the more general case of an arbitrary chosen length:

dna = 'AtcGTgaTgctagcg'.upper()
length = 2

from collections import Counter
counts = Counter(dna[i:i+length] for i in range(len(dna) - length + 1))

for string, count in counts.items():
    print(string, count)

And one that counts every substring, since you said "undetermined lengths":

dna = 'AtcGTgaTgctagcg'.upper()

from collections import Counter
counts = Counter(dna[i:j+1] for j in range(len(dna)) for i in range(j+1))

for string, count in counts.items():
    print(string, count)

score 0 · Answer 2 · answered May 17 '15 at 18:12

The simplest algorithm using slices:

checks = ['AA']
string = 'AAAAA'
before = string[0:1]
for letter in string[1:]:
   if before + letter in checks:
     print "found = " + checks[checks.index(before+letter)]
    before = letter

output:

found AA
found AA
found AA
found AA

score 0 · Answer 3 · edited May 23 '17 at 10:26

The answer linked to above in the comments (How to find overlapping matches with a regexp?) is probably the most efficient way to do this.

That said, there's also nothing wrong with a generator expression and a collections.Counter in this situation, in my opinion.

 seq_length = 2
 string = "atctatcta"
 counts = collections.Counter(a[i:i + seq_length] for i in range(len(string) - seq_length)

 print counts  # Counter({'ct': 2, 'at': 2, 'tc': 2, 'ta': 1})
 print counts["ct"]  # 2
 print counts["ta"]  # 1

collections.Counter takes an iterable (e.g. a generator expression) and returns the frequency of each of the items in the sequence as a dict-like mapping. The generator expression here uses a little indexing logic to generate each of the seq_length substrings lazily.

Shashank · Answer 4 · 2015-05-17T20:12:44.377

from collections import Counter

dnaseq = 'AtcGTgaTgctagcg'
dnaseq_upper = dnaseq.upper()
it1 = iter(dnaseq_upper)
it2 = iter(dnaseq_upper)
next(it2)
adjacent_pairs_iterator = (''.join(pair) for pair in zip(it1, it2))
cntr = Counter(adjacent_pairs_iterator)
output_generator = ('{}: {}'.format(ss, cnt) for ss, cnt in cntr.items())
print(*output_generator, sep='\n')

This is a low-memory solution using iterators. The memory bottleneck here will be your Counter object which is hard to avoid using.

This outputs:

GT: 1
TA: 1
TC: 1
CT: 1
GA: 1
TG: 2
CG: 2
AG: 1
AT: 2
GC: 2

This is written to work in Python 3.x by the way, so if you're in Python 2.x, you'll have to be careful. zip will have to be changed into itertools.izip, cntr.items will have to be changed into cntr.iteritems, the print function will not work, and perhaps other things as well.

Count specific groups of letters in python without excluding them

4 Answers4