count the number of a certain triplet in a file (DNA codon analysis)

Question

This question is actually for DNA codon analysis, to put it in a simple way, let's say I have a file like this:
atgaaaccaaag...
and I want to count the number of 'aaa' triplet present in this file. Importantly, the triplets start from the very beginning (which means atg,aaa,cca,aag,...) So the result should be 1 instead of 2 'aaa' in this example.
Is there any Python or Shellscript methods to do this? Thanks!

Can it be assumed that there are no errors in the file, i.e. that each set of three letters will always indicate a valid set? — Michael Todd, Sep 26 '12 at 20:57
Im guessing he is getting FASTA files... they are typically accepted as valid ... — Joran Beasley, Sep 26 '12 at 21:03
Related to / possible duplicate of [Split string by count of characters](http://stackoverflow.com/questions/7111068/split-string-by-count-of-characters) — senderle, Sep 26 '12 at 21:07
Also related: [What is the most “pythonic” way to iterate over a list in chunks?](http://stackoverflow.com/questions/434287/what-is-the-most-pythonic-way-to-iterate-over-a-list-in-chunks) — senderle, Sep 26 '12 at 21:09
Moved this to a comment... You will find BioPython very helpful: http://biopython.org/wiki/Biopython — Bitwise, Sep 26 '12 at 22:18

Joran Beasley · Accepted Answer · 2012-09-26T21:00:37.023

7

first readin the file

with open("some.txt") as f:
    file_data = f.read()

then split it into 3's

codons = [file_data[i:i+3] for i in range(0,len(file_data),3)]

then count em

print codons.count('aaa')

like so

>>> my_codons = 'atgaaaccaaag'
>>> codons = [my_codons[i:i+3] for i in range(0,len(my_codons),3)]
>>> codons
['atg', 'aaa', 'cca', 'aag']
>>> codons.count('aaa')
1

edited Sep 26 '12 at 21:00

answered Sep 26 '12 at 20:55

Joran Beasley

110,522
12
160
179

score 2 · Answer 2 · answered Sep 26 '12 at 20:58

The obvious solution is to split the string into 3-character pieces and then count the number of occurrences of "aaa":

s = 'atgaaaccaaag'
>>> [s[i : i + 3] for i in xrange(0, len(s), 3)].count('aaa')
1

If the string is really long then this solution will chew up some memory unnecessarily creating the list of substrings.

s = 'atgaaaccaaag'
>>> sum(s[i : i + 3] == 'aaa' for i in xrange(0, len(s), 3))
1
>>> s = 'aaatttaaacaaagg'
>>> sum(s[i : i + 3] == 'aaa' for i in xrange(0, len(s), 3))
2

This uses a generator expression instead of creating a temporary list, so it will be more memory efficient. It takes advantage of the fact that True == 1, i.e. True + True == 2.

score 1 · Answer 3 · answered Sep 26 '12 at 20:58

You could first break the string into triples, using something like:

def split_by_size(input, length):
    return [input[i:i+length] for i in range(0, len(input), length)]

tripleList = split_by_size(input, length)

Then check for "aaa", and sum it up:

print sum(filter(lambda x: x == "aaa", tripleList))

score 0 · Answer 4 · answered Sep 26 '12 at 21:56

0

using a simple shell, assuming your fasta only contains one sequence.

grep -v ">"  < input.fa |
tr -d '\n' |
sed 's/\([ATGCatgcNn]\{3,3\}\)/\1#/g' |
tr "#" "\n" |
awk '(length($1)==3)' |
sort |
uniq -c

answered Sep 26 '12 at 21:56

Pierre

34,472
31
113
192

count the number of a certain triplet in a file (DNA codon analysis)

4 Answers4

Linked