3

I am writing a program that is supposed to return the minimum sequence alignment score (smaller = better), and it worked with the Coursera sample inputs, but for the dataset we're given, I can't manually input the sequences, so I have to resort to using a textfile. There are a few things which I found weird. First things first,

pattern = 'AAA'
DNA = open('practice_data.txt')
empty = []
for lines in DNA:
        line=lines.strip().strip('\n')
        empty.append(line)
print(empty)
print(smallest_distance(pattern, DNA))    

If I run this, my program outputs 0. If I comment out for loop, my program outputs 2. I didn't change DNA, so why should my program behave differently? Also, my strip('\n') is working (and for some reason, strip('n') works just as well) but my strip() is not working. Once I figure this out, I can test out empty in my smallest_distance function.

Here is what my data looks like:

ACTAG
CTTAGTATCACTCTGAAAAGAGATTCCGTATCGATGACCGCCAGTTAATACGTGCGAGAAGTGGACACGGCCGCCGACGGCTTCTACACGCTATTACGATG AACCAACAATTGCTCGAATCCTTCCTCAAAATCGCACACGTCTCTCTGGTCGTAGCACGGATCGGCGACCCACGCGTGACAGCCATCACCTATGATTGCCG 
TTAAGGTACTGCTTCATTGATCAACACCCCTCAGCCGGCAATCACTCTGGGTGCGGGCTGGGTTTACAGGGGTATACGGAAACCGCTGCTTGCCCAATAAT

etc...
DrJessop
  • 462
  • 6
  • 26
  • please give practice_data.txt . You can post on gist.github.com and give us the link here. – Haha TTpro Aug 12 '17 at 15:30
  • 1
    The `for` loop consumes `DNA`. If you comment it out, it doesn't. This is likely to make a difference to the `smallest_distance(pattern, DNA)` call. – janos Aug 12 '17 at 15:30
  • 1
    [You may be interested in this CodeReview question.](https://codereview.stackexchange.com/questions/135217/matlab-implementation-of-needleman-wunsch-algorithm) – Joseph Farah Aug 12 '17 at 15:59

3 Answers3

1

potential errors:

print(smallest_distance(pattern, DNA))  

DNA is file descriptor, not a string array. Because DNA = open('practice_data.txt')

For loop consume DNA. So, if you are using for loop for lines in DNA: again in smallest_distance, it doesn't work.

Update: In this case, the for loop go from the beginning of file to the end. It would not go back again like a list. Unless you call DNS.close() and re-initialize file descriptor again DNA = open('practice_data.txt')

An simple example you can try

DNA = open('text.txt')
for lines in DNA:
        line=lines.strip().strip('\n')
        print (line) # print everything in the file here

print ('try again')
for lines in DNA:
        line=lines.strip().strip('\n')
        print (line) # will not print anything at all 

print ('done')

Read For loop not working twice on the same file descriptor for more detail

Haha TTpro
  • 5,137
  • 6
  • 45
  • 71
1

Solution:

pattern = 'AAA'
with open('practice_data.txt') as f_dna:
    dna_list = [sequence for line in f_dna for sequence in line.split()]
print(smallest_distance(pattern, dna_list))

Explanation:

You were close to the solution, but you needed to remplace strip() by split()

-> strip() remove the extra characters, so your strip('\n') was a good guess. But since \n is at the end of the line, split will automatically get rid of it because it is count as a delimitor

e.g

>>> 'test\ntest'.split()
>>> ['test', 'test']

>>> 'test\n'.split()
>>> ['test']

Now you have to remplace .append() by a simple addition between list operation since split returns a list.

DNA = open('practice_data.txt')
empty = []
for lines in DNA:
    line = lines.split()
    empty += line

But, there is still some problems in your code:

It is better to use the with statement while opening a file because it automatically handles exceptions and close the file descriptor at the end:

empty = []
with open('practice_data.txt') as DNA:  
    for lines in DNA:
        line = lines.split()
        empty += line

Your code is now fine, you can still refactor using list-comprehension (very common in python)

with open('practice_data.txt') as DNA:
    empty = [sequence for line in DNA for sequence in line.split()]

If you struggle understanding this; try to recompose it with for loop

empty = []
with open('practice_data.txt') as DNA:
    for line in DNA:
        for sequence in line.split():
            empty.append(sequence)

Note: @MrGeek solution works, but as two major defaults:

  • as it is not using a with statement, the file is never closed, causing memory issue,
  • using .read().splitlines() will load ALL the content of the file in memory, this could lead to MemoryError exception if the file is too big.

Go further, handle huge file:

Now imaging that you have a 1GO file filled with DNA sequences, even if you don't load all your file in memory, you still have a huge dict, a better pratice will be to create another file for the result and process your DNA on the fly:

e.g

pattern = 'AAA'
with open('practice_data.txt') as f_dna, open('result.txt', 'w') as f_result:
    for line in DNA:
        for sequence in line.split():
            result = smallest_distance(pattern, sequence)
            f_result.write(result)

Warning: You will have to make sure your function smallest_distance accepts a string rather than a list.

If not possible, you may need to process batch instead, but since it is a little complicated I will not talk of this here.

Now you can refactor a bit using for example a genetor function to improve readability

def extract_sequence(file, pattern):
    for line in file:
        for sequence in line.split():
            yield smallest_distance(pattern, sequence)

pattern = 'AAA'
with open('practice_data.txt') as f_dna, open('result.txt', 'w') as f_result:
    for result in extract_sequence(f_dna, pattern):
        f_result.write(result)
Kruupös
  • 5,097
  • 3
  • 27
  • 43
  • 1
    Awesome explanation! Thank you for taking the time to explain why even though certain methods work in getting the answer, they shouldn't be used for other reasons. And yes, I do understand list comprehension, but thanks anyway for being thorough! – DrJessop Aug 12 '17 at 18:20
  • @DrJessop, it was for me also, before writing nested list-comprehension, I always do the `for loop`s first and refactor afterward ^^' anyway Im glad you enjoy the explanation! – Kruupös Aug 12 '17 at 18:28
0

Write :

pattern = 'AAA'
DNA = open('practice_data.txt').read().splitlines()
newDNA = []
for line in DNA:
  newDNA += line.split() # create an array with strings then concatenate it with the newDNA array
print(smallest_distance(pattern, newDNA))
DjaouadNM
  • 22,013
  • 4
  • 33
  • 55