1

Suppose I have a file with n DNA sequences, each one in a line. I need to turn them into a list and then calculate each sequence's length and then total length of all of them together. I am not sure how to do that before they are into a list.

# open file and writing each sequences' length
f= open('seq.txt' , 'r')
for line in f:
    line= line.strip()
    print (line)
    print ('this is the length of the given sequence', len(line))

# turning into a list:  
lines = [line.strip() for line in open('seq.txt')]
print (lines)

How can I do math calculations from the list? Ex. the total length of all sequences together? Standard deviation from their different lengths etc.

Lafexlos
  • 7,618
  • 5
  • 38
  • 53
  • I gave the answer that shows how to build a list of the lengths of the individual sequences as wall as the total. Once you have the list of lengths you can do the statistics on it. – sabbahillel Sep 01 '16 at 15:26

6 Answers6

2

Try this to output the individual length and calculate the total length:

    lines = [line.strip() for line in open('seq.txt')]
    total = 0
    for line in lines:
       print 'this is the length of the given sequence: {}'.format(len(line))
       total += len(line)
    print 'this is the total length: {}'.format(total)
gus27
  • 2,616
  • 1
  • 21
  • 25
1

Look into the statistics module. You'll find all kinds of measures of averages and spreads.

You'll get the length of any sequence using len.

In your case, you'll want to map the sequences to their lengths:

from statistics import stdev

with open("seq.txt") as f:
    lengths = [len(line.strip()) for line in f]

print("Number of sequences:", len(lengths))
print("Standard deviation:", stdev(lengths))

edit: Because it was asked in the comments: Here's how to cluster the instances into different files depending on their lengths:

from statistics import stdev, mean
with open("seq.txt") as f:
    sequences = [line.strip() for line in f]
lengths = [len(sequence) for sequence in sequences]

mean_ = mean(lengths)
stdev_ = stdev(lengths)

with open("below.txt", "w") as below, open("above.txt", "w") as above, open("normal.txt", "w") as normal:
    for sequence in sequences:
        if len(sequence) > mean+stdev_:
            above.write(sequence + "\n")
        elif mean+stdev_ > len(sequence > mean-stdev_: #inbetween
            normal.write(sequence + "\n")
        else:
            below.write(sequence + "\n")
L3viathan
  • 26,748
  • 2
  • 58
  • 81
  • I don't have numerical data on my files. Only strings of DNA. When I do len(lines), it gives me the number of strings I have, but not the total length of all string together. – Marina Mitie Monobe Sep 01 '16 at 15:26
  • I tried to use float or int, to see if the length would work, not it didn't. – Marina Mitie Monobe Sep 01 '16 at 15:26
  • If you want the total length of all sequences, use `sum(lengths)`. – L3viathan Sep 01 '16 at 15:27
  • That worked! Thanks! Why it's not accepting print ("Mean of the length of my sequences:)", mean(lengths))? is there another code for mean? – Marina Mitie Monobe Sep 01 '16 at 15:58
  • You'll also need to import mean, by writing `from statistics import stdev, mean`. Or just do `import statistics` and do `statistics.mean(lenghts)`. – L3viathan Sep 01 '16 at 16:00
  • gotcha! so either I tell each thing I want to import or I just say import statistics? Thanks you are awesome! I was doing all by hand the codes like: mean = (line[0] + ...) you know? I knew how to do the python operations from the keyboard, but now that I need to use an interpreter I got little lost. – Marina Mitie Monobe Sep 01 '16 at 16:03
  • Exactly! You could even do `from statistics import *` to import everything, but I don't recommend it, as it gets confusing quickly. If you'd want to do it manually, mean would be `sum(lengths) / len(lengths)`, no need to sum every line manually. – L3viathan Sep 01 '16 at 16:05
  • once I have the stdev, how can I classify my sequences by size.. I was trying to include: for line in f: if len(line) > (mean(lenghts)) + (stdv(lengths)) print (line)....... you know what I mean? I am trying to export my sequences in different files, where some will be > mean+stdv and other will be – Marina Mitie Monobe Sep 01 '16 at 16:31
  • I'll edit it in the answer. You want three clusters (within stdv, below, above)? – L3viathan Sep 01 '16 at 16:33
  • Yes. I need to organize my terrible file into 3 files clustering by their size. So I added outfile1 = 'seq_high.txt' outfile2 = 'seq_low.txt' outfile3 = 'seq_normal.txt' ouf1 = open( outfile1,'w') ouf2 = open (outfile2, 'w') ouf3 = open (outfile3, 'w'), in which within is normal, low = below and high = above – Marina Mitie Monobe Sep 01 '16 at 16:41
  • oo I see, I needed to open again the file! Ok, so I can do if.. for > , if .. for < and else for "within"? – Marina Mitie Monobe Sep 01 '16 at 16:45
1

The map and reduce functions can be useful to work on collections.

import operator

f= open('seq.txt' , 'r')
for line in f:
  line= line.strip()
  print (line)
  print ('this is the length of the given sequence', len(line))

# turning into a list:
lines = [line.strip() for line in open('seq.txt')]
print (lines)

print('The total length is 'reduce(operator.add,map(len,lines)))
Jonas
  • 737
  • 1
  • 8
  • 20
0

Just a couple of remarks. Use with to handle files so you don't have to worry about closing them after you are done reading\writing, flushing, etc. Also, since you are looping through the file once, why not create the list too? You don't need to go through it again.

# open file and writing each sequences' length
with open('seq.txt', 'r') as f:
    sequences = []
    total_len = 0
    for line in f:
        new_seq = line.strip()
        sequences.append(new_seq)
        new_seq_len = len(new_seq)
        total_len += new_seq_len

print('number of sequences: {}'.format(len(sequences)))
print('total lenght: {}'.format(total_len))
print('biggest sequence: {}'.format(max(sequences, key=lambda x: len(x))))
print('\t with length {}'.format(len(sorted(sequences, key=lambda x: len(x))[-1])))
print('smallest sequence: {}'.format(min(sequences, key=lambda x: len(x))))
print('\t with length {}'.format(len(sorted(sequences, key=lambda x: len(x))[0])))

I have included some post-processing info to give you an idea of how to go about it. If you have any questions just ask.

Ma0
  • 15,057
  • 4
  • 35
  • 65
  • thanks! So, yeah I created a list using: lines = [line.strip() for line in open('seq.txt')]. Then now my list is ['AGATAAGATAGTAGAT', 'GTAAGTGATGATAGTAGTA', etc]. However, once I try to do the length len(lines), it gives me only the total number of strings, but not the total length of all strings together. – Marina Mitie Monobe Sep 01 '16 at 15:21
  • @MarinaMitieMonobe this happens here `total_len += new_seq_len` – Ma0 Sep 01 '16 at 15:25
  • print('biggest sequence: {}'.format(max(sequences, key=lambda x: len(x), reversed=True))) TypeError: 'reversed' is an invalid keyword argument for this function – Marina Mitie Monobe Sep 01 '16 at 15:30
  • @MarinaMitieMonobe I had some mistakes which are now corrected. Sorry but i was doing it by memory. Please copy it again – Ma0 Sep 01 '16 at 15:31
  • At the same time you calculate the total create a list `sizes` and append `new_seq_len` to it to allow you to do various calculations later on that list.. – sabbahillel Sep 01 '16 at 15:31
  • @sabbahillel this can be done at any time if one has the strings. For example like `sizes = [len(x) for x in sequences]` and then maybe even `sequences_aug = list(zip(sequences, sizes))` – Ma0 Sep 01 '16 at 15:39
0

You have already seen how to get the list of sequences and a list of the lengths using append.

    lines = [line.strip() for line in open('seq.txt')]
    total = 0
    sizes = []
    for line in lines:
       mysize = len(line)
       total += mysize
       sizes.append(mysize)

Note that you can also use a for loop to read each line and append to the two lists rather than read every line into lists and then loop through lists. It is a matter of which you would prefer.

You can use the statistics library (as of Python 3.4) for the statistics on the list of lengths.

statistics — Mathematical statistics functions

mean() Arithmetic mean (“average”) of data. median() Median (middle value) of data. median_low() Low median of data.
median_high() High median of data. median_grouped() Median, or 50th percentile, of grouped data. mode() Mode (most common value) of discrete data. pstdev() Population standard deviation of data.
pvariance() Population variance of data. stdev() Sample standard deviation of data. variance() Sample variance of data.

You can also use the answers at Standard deviation of a list

Note that there is an answer that actually shows the code that was added to Python 3.4 for the statistics module. If you have an older version, you can use that code or get the statistics module code for your own system.

Community
  • 1
  • 1
sabbahillel
  • 4,357
  • 1
  • 19
  • 36
0

This will do what you require. To do additional calculations you may want to save your results from the text file into a list or set so you won't need to read from a file again.

total_length = 0  # Create a variable that will save our total length of lines read

with open('filename.txt', 'r') as f:
    for line in f:
        line = line.strip()
        total_length += len(line)  # Add the length to our total
        print("Line Length: {}".format(len(line)))

print("Total Length: {}".format(total_length))
A Magoon
  • 1,180
  • 2
  • 13
  • 21