I need to improve a function that parses a multi fasta file, checking for compression with a try execpt handling

Question

Hi guys I am working with a huge gz compressed fasta file, and I have a nice fasta parser but I would like to make it more general, in the way I can check for compression, to parse a gz or a not compressed file.

I try this code:

def is_header(line):
    return line[0] == '>'

def parse_multi_fasta_file_compressed_or_not(filename):
    if filename.endswith('.gz'):
        with gzip.open(filename, 'rt') as f:
            fasta_iter = (it[1] for it in itertools.groupby(f, is_header))
    else:
        with open(filename, 'r') as f:
            fasta_iter = (it[1] for it in itertools.groupby(f, is_header))
            for name in fasta_iter:
                name = name.__next__()[1:].strip()
                sequences = ''.join(seq.strip() for seq in fasta_iter.__next__())
                yield name, sequences

ref: https://drj11.wordpress.com/2010/02/22/python-getting-fasta-with-itertools-groupby/
https://www.biostars.org/p/710/

I tried to modify the identation. Python doesn't complain about any error. However, it doesn't print or show any results. I am using a toy file with 5 sequences.

Just to remind a fasta file is something like that:

>header_1
AATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCTCGCACTCGCGAGATAC
TCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCC
TGCACTCAGGATGAAGTGGATGATG
>header_2
AACCATTTGAATGGATGTCAATCCGACTTTACTTTTCTTGAAAGTTCCAGCGCAAAATGCCATAAGCACC
ACATTTCCCTATACTGGAGACCCTCC

I would like to use some try:... except:... instead of if.
If any of you have any tip to help me figure that out, I would appreciate it a lot (it's not any course exercice at all!).

Thank you for your time.

Paulo

I wrote a general solution using biopython https://stackoverflow.com/a/52839332/6260170 — Chris_Rands, Jan 27 '20 at 14:34
That is nice to Chris. Thank you for your time and attention. :) — Paulo Sergio Schlogl, Jan 27 '20 at 14:49
I tend to use `fastx_read` from the mappy package: it parses fastq and fasta, gzipped or not, transparently. See https://github.com/lh3/minimap2/blob/master/python/README.rst#miscellaneous-functions — bli, Jan 28 '20 at 10:35

score 1 · Accepted Answer · answered Jan 27 '20 at 14:40

It looks like you have overly indented your `for loop. Try the following:

def is_header(line):
    return line[0] == '>'

def parse_multi_fasta_file_compressed_or_not(filename):
    if filename.endswith('.gz'):
        opener = lambda filename: gzip.open(filename, 'rt')
    else:
        opener = lambda filename: open(filename, 'r')

    with opener(filename) as f:
        fasta_iter = (it[1] for it in itertools.groupby(f, is_header))
        for name in fasta_iter:
            name = name.__next__()[1:].strip()
            sequences = ''.join(seq.strip() for seq in fasta_iter.__next__())
            yield name, sequences

I've also rearranged things a little so you can use the with block as you did before. The conditional at the beginning assigns to opener a function which can open the given file depending on whether it is gzipped or not.

Hi Michael thank you for your time and attention. I will check it out. 8) — Paulo Sergio Schlogl, Jan 27 '20 at 14:44

I need to improve a function that parses a multi fasta file, checking for compression with a try execpt handling

1 Answers1