How to create a dataset using sequence file in python

Question

I have a protein sequence file looks like this:

>102L:A       MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL       -------------------------------------------------------------------------------------------------------------------------------------------------------------------XX

The first one is the name of the sequence, the second one is the actual protein sequence, and the first one is the indicator that shows if there is any missing coordinates. In this case, notice that there is two "X" in the end. That means that the last two residue of the sequence witch are "NL" in this case are missing coordinates.

By coding in Python I would like to generate a table which should look like this:

name of the sequence
total number of missing coordinates (which is the number of X)
the range of these missing coordinates (which is the range of the position of those X) 4)the length of the sequence 5)the actual sequence

So the final results should looks like this:

>102L:A 2 163-164 164 MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

And my code looks like this so far:

total_seq = []
with open('sample.txt') as lines:
    for l in lines:
        split_list = l.split()

        # Assign the list number
        header = split_list[0]                                # 1
        seq = split_list[1]                                   # 5
        disorder = split_list[2]

        # count sequence length and total residue of missing coordinates
        sequence_length = len(seq)                            # 4

        for x in disorder:
            counts = 0
            if x == 'X':
                counts = counts + 1

        total_seq.append([header, seq, str(counts)])   # obviously I haven't finish coding 2 & 3

with open('new_sample.txt', 'a') as f:
    for lol in total_seq:
        f.write('\n'.join(lol))

I'm new in python, would anyone help please?

You are doing this to create a table which you can load into R? Why can't you load the sequences into R? Check out the SeqinR package. — wflynny, Jul 11 '14 at 17:03
@jonrsharpe My question is how to create a file looks like my final results using the sequence file I have at the first place. Sorry for the confusion. — Jlod888, Jul 11 '14 at 17:21
@Bill Not necessarily load into R. I just want to created a file looks like my final results. Sorry for the confusion. — Jlod888, Jul 11 '14 at 17:22
@Jlod888 the answer is "write code"; this question is too broad for SO — jonrsharpe, Jul 11 '14 at 17:23

score 0 · Accepted Answer · answered Jul 11 '14 at 17:56

Here's your modified code. It now produces your desired output.

with open("sample.txt") as infile:
    matrix  = [line.split() for line in infile.readlines()]

    header_list  = [row[0] for row in matrix]
    seq_list = [str(row[1]) for row in matrix]
    disorder_list = [str(row[2]) for row in matrix]

f = open('new_sample.txt', 'a')

for i in range(len(header_list)):
    header = header_list[i]
    seq = seq_list[i]
    disorder = disorder_list[i]

    # count sequence length and total residue of missing coordinates
    sequence_length = len(seq)                            

    # get total number of missing coordinates
    num_missing = disorder.count('X')             

    # get the range of these missing coordinates
    first_X_pos = disorder.find('X')
    last_X_pos = disorder.rfind('X')
    range_missing = '-'.join([str(first_X_pos), str(last_X_pos)])

    reformat_seq=" ".join([header, str(num_missing), range_missing, str(sequence_length), seq, '\n'])  
    f.write(reformat_seq)

f.close()

Some more tips:

Don't forget about python's string functions. They will solve a lot of your problems automatically. The documentation is very good.

If you searched for how to do just part 2 or just part 3 in your question, you would find the results elsewhere.

Thank you again. But I have another question. You code work perfectly with only one missing coordinates segment. So in the example I provided above. There is only one missing coordinates segment, which is located at the last two position. But What if there is multiple segments? such as: "--------XXXXX------XX-----X" — Jlod888, Jul 11 '14 at 19:44
The count should still be fine. How would you want the range formatted in that case? There's some advice about getting all indices where x appears here: http://stackoverflow.com/questions/13009675/find-all-the-occurrences-of-a-character-in-a-string — hbuchman, Jul 11 '14 at 20:38
Thanks again! So In your code, the range of this "-----XX" would be 5-6 (cause it starts with 0). But using your code the rage of "-XX--XX" would be 1-6. But I would like the range show up like 1-2, 5-6 in this case. Sorry to bother you again! — Jlod888, Jul 11 '14 at 21:12

How to create a dataset using sequence file in python

1 Answers1