Hi guys I am working with a huge gz compressed fasta file, and I have a nice fasta parser but I would like to make it more general, in the way I can check for compression, to parse a gz or a not compressed file.
I try this code:
def is_header(line):
return line[0] == '>'
def parse_multi_fasta_file_compressed_or_not(filename):
if filename.endswith('.gz'):
with gzip.open(filename, 'rt') as f:
fasta_iter = (it[1] for it in itertools.groupby(f, is_header))
else:
with open(filename, 'r') as f:
fasta_iter = (it[1] for it in itertools.groupby(f, is_header))
for name in fasta_iter:
name = name.__next__()[1:].strip()
sequences = ''.join(seq.strip() for seq in fasta_iter.__next__())
yield name, sequences
ref:
https://drj11.wordpress.com/2010/02/22/python-getting-fasta-with-itertools-groupby/
https://www.biostars.org/p/710/
I tried to modify the identation. Python doesn't complain about any error. However, it doesn't print or show any results. I am using a toy file with 5 sequences.
Just to remind a fasta file is something like that:
>header_1
AATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCTCGCACTCGCGAGATAC
TCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCC
TGCACTCAGGATGAAGTGGATGATG
>header_2
AACCATTTGAATGGATGTCAATCCGACTTTACTTTTCTTGAAAGTTCCAGCGCAAAATGCCATAAGCACC
ACATTTCCCTATACTGGAGACCCTCC
I would like to use some try:... except:...
instead of if
.
If any of you have any tip to help me figure that out, I would appreciate it a lot (it's not any course exercice at all!).
Thank you for your time.
Paulo