2

I'm working on the problem "Consensus nd Profile" on the Rosalind Bioinformatics website (http://rosalind.info/problems/cons/). I tried my code using the sample input on the website and my output matches the sample output. But when I tried the larger dataset the website said my output is wrong. Could someone help me identify where my problem is? Thank you!

Sample input:

>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT

I've extracted the dna strings and stored them in a list called strings (my trial with the larger dataset is correct at this step so I omitted my code here):

['ATCCAGCT', 'GGGCAACT', 'ATGGATCT', 'AAGCAACC', 'TTGGAACT', 'ATGCCATT', 'ATGGCACT']

My code afterwards:

#convert strings into matrix
matrix = []
for i in strings:
    matrix.append([j for j in i])
M = np.array(matrix).reshape(len(matrix),len(matrix[0]))

M looks like this for sample input:

[['A' 'T' 'C' 'C' 'A' 'G' 'C' 'T']
 ['G' 'G' 'G' 'C' 'A' 'A' 'C' 'T']
 ['A' 'T' 'G' 'G' 'A' 'T' 'C' 'T']
 ['A' 'A' 'G' 'C' 'A' 'A' 'C' 'C']
 ['T' 'T' 'G' 'G' 'A' 'A' 'C' 'T']
 ['A' 'T' 'G' 'C' 'C' 'A' 'T' 'T']
 ['A' 'T' 'G' 'G' 'C' 'A' 'C' 'T']]

My code afterwards:

#convert string matrix into profile matrix
A = []
C = []
G = []
T = []
for i in range(len(matrix[0])):
    A_count = 0
    C_count = 0
    G_count = 0
    T_count = 0
    for j in M[:,i]:
        if j == "A":
            A_count += 1
        elif j == "C":
            C_count += 1
        elif j == "G":
            G_count += 1
        elif j == "T":
            T_count += 1
    A.append(A_count)
    C.append(C_count)
    G.append(G_count)
    T.append(T_count)

profile_matrix = {"A": A, "C": C, "G": G, "T": T}
for k, v in profile_matrix.items():
    print k + ":" + " ".join(str(x) for x in v)

#get consensus string
P = []
P.append(A)
P.append(C)
P.append(G)
P.append(T)
profile = np.array(P).reshape(4, len(A))
consensus = []
for i in range(len(A)):
    if max(profile[:,i]) == profile[0,i]:
        consensus.append("A")
    elif max(profile[:,i]) == profile[1,i]:
        consensus.append("C")
    elif max(profile[:,i]) == profile[2,i]:
        consensus.append("G")
    elif max(profile[:,i]) == profile[3,i]:
        consensus.append("T")
print "".join(consensus)

These codes give the correct sample output:

A:5 1 0 0 5 5 0 0
C:0 0 1 4 2 0 6 1
T:1 5 0 0 0 1 1 6
G:1 1 6 3 0 1 0 0
ATGCAACT

But when I tried the larger dataset the website said my answer was wrong...Could someone point out where I'm wrong? (I'm a beginner, thank you for your patience!)

largecats
  • 195
  • 1
  • 14
  • Make sure your format matches the sample output exactly. That means add a space after the colon on each line, put the consensus string at the top, and have the nucleotides in order of "ACGT". – C_Z_ Jul 26 '16 at 15:27
  • Have you seen this? http://stackoverflow.com/questions/38586800/python-multiple-consensus-sequences/ – Chris_Rands Jul 26 '16 at 15:30
  • @C_Z_ Oh my I didn't notice that before...Thank you for the tip! – largecats Jul 29 '16 at 06:40
  • @Chris_Rands Yes but our problems seem different so I decided to open up a new question...Thank you for the note and I'll keep making sure that I checked previous Q&As before asking my own question in the future. – largecats Jul 29 '16 at 06:42

2 Answers2

1

Your algorithm is totally fine. As @C_Z_ pointed out "make sure your format matches the sample output exactly" which is unfortunately not the case.

print k + ":" + " ".join(str(x) for x in v)

should be

print k + ": " + " ".join(str(x) for x in v)

and come after, not before, the consensus sequence. If you change the order and add the space your answer will get get accepted by rosalind.


Since that's a trivial answer to your question, here is an alternative solution for the same problem without using numpy: Instead of using variable for each nucleotide, use a dictionary. It's not fun to do the same thing with 23 amino acids, e.g.

from collections import defaultdict
for i in range(len(strings[0])):
    counter.append(defaultdict(int))
    for seq in seqs:
        counter[i][seq[i]] += 1
    consensus += max(counter[i], key=counter[i].get)

counter stores a dictionary for each position with all the counts for all bases. The key for the dictionary is the current base.

Maximilian Peters
  • 30,348
  • 12
  • 86
  • 99
  • thank you very much for the answer (and the code)! I tried matching my format but rosalind still says it's wrong...I'm wondering if it could be that my output somehow contains \n and as a result, instead of going all the way till the end of the line (i.e. maximum characters Rosalind allows within a line), my output becomes a block...What do you think? Thank you! – largecats Jul 29 '16 at 06:38
  • Dictionaries are unsorted, so if you iterate through it, the key order is random. You could try: `for n in 'ACGT': print k + ": " + " ".join(str(x) for x in profile_matrix[n])`and thereby enforce the correct order. – Maximilian Peters Jul 29 '16 at 08:28
0

I encountered the same problem. It's because the browser automatically adds "\n" when you paste a long sequence into the text frame. Simply submit a .txt file instead.