Solution:
pattern = 'AAA'
with open('practice_data.txt') as f_dna:
dna_list = [sequence for line in f_dna for sequence in line.split()]
print(smallest_distance(pattern, dna_list))
Explanation:
You were close to the solution, but you needed to remplace strip()
by split()
-> strip()
remove the extra characters, so your strip('\n')
was a good guess.
But since \n
is at the end of the line, split will automatically get rid of it because it is count as a delimitor
e.g
>>> 'test\ntest'.split()
>>> ['test', 'test']
>>> 'test\n'.split()
>>> ['test']
Now you have to remplace .append()
by a simple addition between list operation since split returns a list
.
DNA = open('practice_data.txt')
empty = []
for lines in DNA:
line = lines.split()
empty += line
But, there is still some problems in your code:
It is better to use the with
statement while opening a file because it automatically handles exceptions and close the file descriptor at the end:
empty = []
with open('practice_data.txt') as DNA:
for lines in DNA:
line = lines.split()
empty += line
Your code is now fine, you can still refactor using list-comprehension (very common in python)
with open('practice_data.txt') as DNA:
empty = [sequence for line in DNA for sequence in line.split()]
If you struggle understanding this; try to recompose it with for loop
empty = []
with open('practice_data.txt') as DNA:
for line in DNA:
for sequence in line.split():
empty.append(sequence)
Note: @MrGeek solution works, but as two major defaults:
- as it is not using a
with
statement, the file is never closed, causing memory issue,
- using
.read().splitlines()
will load ALL the content of the file in memory, this could lead to MemoryError
exception if the file is too big.
Go further, handle huge file:
Now imaging that you have a 1GO file filled with DNA sequences, even if you don't load all your file in memory, you still have a huge dict
, a better pratice will be to create another file for the result and process your DNA on the fly:
e.g
pattern = 'AAA'
with open('practice_data.txt') as f_dna, open('result.txt', 'w') as f_result:
for line in DNA:
for sequence in line.split():
result = smallest_distance(pattern, sequence)
f_result.write(result)
Warning: You will have to make sure your function smallest_distance
accepts a string
rather than a list
.
If not possible, you may need to process batch instead, but since it is a little complicated I will not talk of this here.
Now you can refactor a bit using for example a genetor function to improve readability
def extract_sequence(file, pattern):
for line in file:
for sequence in line.split():
yield smallest_distance(pattern, sequence)
pattern = 'AAA'
with open('practice_data.txt') as f_dna, open('result.txt', 'w') as f_result:
for result in extract_sequence(f_dna, pattern):
f_result.write(result)