This question is actually for DNA codon analysis, to put it in a simple way, let's say I have a file like this:
atgaaaccaaag...
and I want to count the number of 'aaa' triplet present in this file. Importantly, the triplets start from the very beginning (which means atg,aaa,cca,aag,...) So the result should be 1 instead of 2 'aaa' in this example.
Is there any Python or Shellscript methods to do this? Thanks!
Asked
Active
Viewed 4,132 times
2
-
Can it be assumed that there are no errors in the file, i.e. that each set of three letters will always indicate a valid set? – Michael Todd Sep 26 '12 at 20:57
-
Im guessing he is getting FASTA files... they are typically accepted as valid ... – Joran Beasley Sep 26 '12 at 21:03
-
Related to / possible duplicate of [Split string by count of characters](http://stackoverflow.com/questions/7111068/split-string-by-count-of-characters) – senderle Sep 26 '12 at 21:07
-
Also related: [What is the most “pythonic” way to iterate over a list in chunks?](http://stackoverflow.com/questions/434287/what-is-the-most-pythonic-way-to-iterate-over-a-list-in-chunks) – senderle Sep 26 '12 at 21:09
-
Yes, I am going to process FASTA files. – Runner Sep 26 '12 at 21:19
-
1Moved this to a comment... You will find BioPython very helpful: http://biopython.org/wiki/Biopython – Bitwise Sep 26 '12 at 22:18
4 Answers
7
first readin the file
with open("some.txt") as f:
file_data = f.read()
then split it into 3's
codons = [file_data[i:i+3] for i in range(0,len(file_data),3)]
then count em
print codons.count('aaa')
like so
>>> my_codons = 'atgaaaccaaag'
>>> codons = [my_codons[i:i+3] for i in range(0,len(my_codons),3)]
>>> codons
['atg', 'aaa', 'cca', 'aag']
>>> codons.count('aaa')
1

Joran Beasley
- 110,522
- 12
- 160
- 179
2
The obvious solution is to split the string into 3-character pieces and then count the number of occurrences of "aaa":
s = 'atgaaaccaaag'
>>> [s[i : i + 3] for i in xrange(0, len(s), 3)].count('aaa')
1
If the string is really long then this solution will chew up some memory unnecessarily creating the list of substrings.
s = 'atgaaaccaaag'
>>> sum(s[i : i + 3] == 'aaa' for i in xrange(0, len(s), 3))
1
>>> s = 'aaatttaaacaaagg'
>>> sum(s[i : i + 3] == 'aaa' for i in xrange(0, len(s), 3))
2
This uses a generator expression instead of creating a temporary list, so it will be more memory efficient. It takes advantage of the fact that True == 1
, i.e. True + True == 2
.

John Kugelman
- 349,597
- 67
- 533
- 578
1
You could first break the string into triples, using something like:
def split_by_size(input, length):
return [input[i:i+length] for i in range(0, len(input), length)]
tripleList = split_by_size(input, length)
Then check for "aaa", and sum it up:
print sum(filter(lambda x: x == "aaa", tripleList))

dckrooney
- 3,041
- 3
- 22
- 28
0
using a simple shell, assuming your fasta only contains one sequence.
grep -v ">" < input.fa |
tr -d '\n' |
sed 's/\([ATGCatgcNn]\{3,3\}\)/\1#/g' |
tr "#" "\n" |
awk '(length($1)==3)' |
sort |
uniq -c

Pierre
- 34,472
- 31
- 113
- 192