Weird Value Change without Changing a Text File Python

Question

I am writing a program that is supposed to return the minimum sequence alignment score (smaller = better), and it worked with the Coursera sample inputs, but for the dataset we're given, I can't manually input the sequences, so I have to resort to using a textfile. There are a few things which I found weird. First things first,

pattern = 'AAA'
DNA = open('practice_data.txt')
empty = []
for lines in DNA:
        line=lines.strip().strip('\n')
        empty.append(line)
print(empty)
print(smallest_distance(pattern, DNA))

If I run this, my program outputs 0. If I comment out for loop, my program outputs 2. I didn't change DNA, so why should my program behave differently? Also, my strip('\n') is working (and for some reason, strip('n') works just as well) but my strip() is not working. Once I figure this out, I can test out empty in my smallest_distance function.

Here is what my data looks like:

ACTAG
CTTAGTATCACTCTGAAAAGAGATTCCGTATCGATGACCGCCAGTTAATACGTGCGAGAAGTGGACACGGCCGCCGACGGCTTCTACACGCTATTACGATG AACCAACAATTGCTCGAATCCTTCCTCAAAATCGCACACGTCTCTCTGGTCGTAGCACGGATCGGCGACCCACGCGTGACAGCCATCACCTATGATTGCCG 
TTAAGGTACTGCTTCATTGATCAACACCCCTCAGCCGGCAATCACTCTGGGTGCGGGCTGGGTTTACAGGGGTATACGGAAACCGCTGCTTGCCCAATAAT

etc...

please give practice_data.txt . You can post on gist.github.com and give us the link here. — Haha TTpro, Aug 12 '17 at 15:30
The `for` loop consumes `DNA`. If you comment it out, it doesn't. This is likely to make a difference to the `smallest_distance(pattern, DNA)` call. — janos, Aug 12 '17 at 15:30
[You may be interested in this CodeReview question.](https://codereview.stackexchange.com/questions/135217/matlab-implementation-of-needleman-wunsch-algorithm) — Joseph Farah, Aug 12 '17 at 15:59

Haha TTpro · Answer 1 · 2017-08-12T15:56:19.020

1

potential errors:

print(smallest_distance(pattern, DNA))

DNA is file descriptor, not a string array. Because DNA = open('practice_data.txt')

For loop consume DNA. So, if you are using for loop for lines in DNA: again in smallest_distance, it doesn't work.

Update: In this case, the for loop go from the beginning of file to the end. It would not go back again like a list. Unless you call DNS.close() and re-initialize file descriptor again DNA = open('practice_data.txt')

An simple example you can try

DNA = open('text.txt')
for lines in DNA:
        line=lines.strip().strip('\n')
        print (line) # print everything in the file here

print ('try again')
for lines in DNA:
        line=lines.strip().strip('\n')
        print (line) # will not print anything at all 

print ('done')

Read For loop not working twice on the same file descriptor for more detail

edited Aug 12 '17 at 15:56

answered Aug 12 '17 at 15:36

Haha TTpro

5,137
6
45
71

I keep hearing this word "consumes". What do you mean by that? Consumes system memory? – DrJessop Aug 12 '17 at 15:42
Thanks for the resource – DrJessop Aug 12 '17 at 16:31
In this context, "consumes" just means "reads lines from the file, one by one, until the entire file has been read". – John Gordon Aug 12 '17 at 19:37

score 1 · Accepted Answer · answered Aug 12 '17 at 18:13

Solution:

pattern = 'AAA'
with open('practice_data.txt') as f_dna:
    dna_list = [sequence for line in f_dna for sequence in line.split()]
print(smallest_distance(pattern, dna_list))

Explanation:

You were close to the solution, but you needed to remplace strip() by split()

-> strip() remove the extra characters, so your strip('\n') was a good guess. But since \n is at the end of the line, split will automatically get rid of it because it is count as a delimitor

e.g

>>> 'test\ntest'.split()
>>> ['test', 'test']

>>> 'test\n'.split()
>>> ['test']

Now you have to remplace .append() by a simple addition between list operation since split returns a list.

DNA = open('practice_data.txt')
empty = []
for lines in DNA:
    line = lines.split()
    empty += line

But, there is still some problems in your code:

It is better to use the with statement while opening a file because it automatically handles exceptions and close the file descriptor at the end:

empty = []
with open('practice_data.txt') as DNA:  
    for lines in DNA:
        line = lines.split()
        empty += line

Your code is now fine, you can still refactor using list-comprehension (very common in python)

with open('practice_data.txt') as DNA:
    empty = [sequence for line in DNA for sequence in line.split()]

If you struggle understanding this; try to recompose it with for loop

empty = []
with open('practice_data.txt') as DNA:
    for line in DNA:
        for sequence in line.split():
            empty.append(sequence)

Note: @MrGeek solution works, but as two major defaults:

as it is not using a with statement, the file is never closed, causing memory issue,
using .read().splitlines() will load ALL the content of the file in memory, this could lead to MemoryError exception if the file is too big.

Go further, handle huge file:

Now imaging that you have a 1GO file filled with DNA sequences, even if you don't load all your file in memory, you still have a huge dict, a better pratice will be to create another file for the result and process your DNA on the fly:

e.g

pattern = 'AAA'
with open('practice_data.txt') as f_dna, open('result.txt', 'w') as f_result:
    for line in DNA:
        for sequence in line.split():
            result = smallest_distance(pattern, sequence)
            f_result.write(result)

Warning: You will have to make sure your function smallest_distance accepts a string rather than a list.

If not possible, you may need to process batch instead, but since it is a little complicated I will not talk of this here.

Now you can refactor a bit using for example a genetor function to improve readability

def extract_sequence(file, pattern):
    for line in file:
        for sequence in line.split():
            yield smallest_distance(pattern, sequence)

pattern = 'AAA'
with open('practice_data.txt') as f_dna, open('result.txt', 'w') as f_result:
    for result in extract_sequence(f_dna, pattern):
        f_result.write(result)

Awesome explanation! Thank you for taking the time to explain why even though certain methods work in getting the answer, they shouldn't be used for other reasons. And yes, I do understand list comprehension, but thanks anyway for being thorough! — DrJessop, Aug 12 '17 at 18:20
@DrJessop, it was for me also, before writing nested list-comprehension, I always do the `for loop`s first and refactor afterward ^^' anyway Im glad you enjoy the explanation! — Kruupös, Aug 12 '17 at 18:28

DjaouadNM · Answer 3 · 2017-08-12T16:02:28.437

0

Write :

pattern = 'AAA'
DNA = open('practice_data.txt').read().splitlines()
newDNA = []
for line in DNA:
  newDNA += line.split() # create an array with strings then concatenate it with the newDNA array
print(smallest_distance(pattern, newDNA))

edited Aug 12 '17 at 16:02

answered Aug 12 '17 at 15:33

DjaouadNM

22,013
4
33
55

I attempted this, and for some reason I only get two strings in my list. One for the first line, and one for the rest – DrJessop Aug 12 '17 at 15:41
What role does empty play in here? – DjaouadNM Aug 12 '17 at 15:43
I wanted to create a list of all of the strings in the text file without having to manually input them. – DrJessop Aug 12 '17 at 15:44
There may be multiple strings on one line – DrJessop Aug 12 '17 at 15:46
how are the strings separated if they are on one line? – DjaouadNM Aug 12 '17 at 15:47
They are separated by a space. I tried using .strip() to get rid of this issue, but it isn't working – DrJessop Aug 12 '17 at 15:48
I figured out I need to use the replace method – DrJessop Aug 12 '17 at 15:58
glad to help :). – DjaouadNM Aug 12 '17 at 16:02

Weird Value Change without Changing a Text File Python

3 Answers3