How can you analyse fna.gz in python?

Question

I want to return the nth basepair given my fna.gz genome input. Theoretically it would work like this:

allele = genome[14325]
print(allele)
#: G

This is the code I have now:

from Bio import SeqIO
import gzip
from Bio.Alphabet import generic_dna

input_file = r"C:\Users\blake\PycharmProjects\Transcendence3.0\DNA\GCF_000001405.38_GRCh38.p12_genomic.fna.gz"
output_file = r"C:\Users\blake\PycharmProjects\Transcendence3.0\DNA\Probabilities"

with gzip.open(input_file, "rt") as handle:
    for record in SeqIO.parse(input_file, "fasta", generic_dna):
        fasta_sequences = SeqIO.parse(open(input_file), 'fasta')
        print("seq parsed")
        with open(output_file) as out_file:
            for fasta in fasta_sequences:
                name, sequence = fasta.id, str(fasta.seq)
                new_allele = tell_basepair(sequence)
                write_fasta(out_file)

def tell_basepair(n, seq):
  bp = seq[n-1]
  return bp

but it doesn't work and I get an error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 386: character maps to <undefined>

does it work if you replace "some_function(sequence)" with just "sequence"? — Oka, Apr 02 '19 at 22:26
Possible duplicate of [SeqIO.parse on a fasta.gz](https://stackoverflow.com/questions/42757283/seqio-parse-on-a-fasta-gz) — Chris_Rands, Apr 03 '19 at 07:42
@Blake Young : is the issue solved or do you still get the error? — Oka, Apr 05 '19 at 12:40
@Oka Thanks for your responses. I've tried what you said and am still getting the error I had before. I'm now using the code editted above — Blake Young, Apr 11 '19 at 03:16
@BlakeYoung While in your current code you do open the .gz file as handle, you don´t use the opened file in your `SeqIO.parse` call, but instead you are trying to open the .gz file _again_ with `SeqIO.parse`. Try to modify that line into: " for record in SeqIO.parse(handle, "fasta", generic_dna)..." — Oka, Apr 11 '19 at 14:54
@BlakeYoung Also, what are you trying to do - e.g. what is the desired output? Cause if you just need to print the n-th allele from sequence, you don´t need the output as fasta (which is in your code now) — Oka, Apr 11 '19 at 22:07

Oka · Answer 1 · 2019-04-04T11:21:15.397

0

You could

try to open .gz file first before reading it as fasta [1]:

    with gzip.open("practicezip.fasta.gz", "rt") as handle:
        for record in SeqIO.parse(handle, "fasta"):
            #your code

if the error persists, specify the encoding in your parse-function. As BioPython SeqIO manual states: "For file formats like FASTA where the alphabet cannot be determined, it may be useful to specify the alphabet explicitly". So:

    from Bio import SeqIO
    from Bio.Alphabet import generic_dna 
    filename = "yourfastafilename"
    for record in SeqIO.parse(filename, "fasta", generic_dna):
        # your code

In addition to the UnicodeDecodeError error, you may also want to define your function some_function(sequence), otherwise Python will not know what to do when you call it. For example:

def tell_basepair(n, seq):
  bp = seq[n-1]
  return bp

edited Apr 04 '19 at 11:21

answered Apr 02 '19 at 22:46

Oka

1,318
6
11

this is wrong, they need to use `gzip.open()` see the duplicate – Chris_Rands Apr 03 '19 at 07:43
can you explain how a `UnicodeDecodeError` could be fixed by adding the `generic_dna` alphabet? i can't imagine how myself – Chris_Rands Apr 03 '19 at 14:15
In this case the error is most likely caused by the .gz file (as you correctly pointed out). Sometimes, however, encoding error can be due to the fact that encoding couldn´t be inferred from the file (fasta). If that is the case, this hopefully would help. – Oka Apr 03 '19 at 14:50
the alphabet is just a property of the `Seq` class, it is not the same as the encoding – Chris_Rands Apr 03 '19 at 15:18
You´re right, its not. And encoding would be a third potential cause of the error (at least according to bugreports). – Oka Apr 03 '19 at 15:46

How can you analyse fna.gz in python?

1 Answers1