0

I want to return the nth basepair given my fna.gz genome input. Theoretically it would work like this:

allele = genome[14325]
print(allele)
#: G

This is the code I have now:

from Bio import SeqIO
import gzip
from Bio.Alphabet import generic_dna

input_file = r"C:\Users\blake\PycharmProjects\Transcendence3.0\DNA\GCF_000001405.38_GRCh38.p12_genomic.fna.gz"
output_file = r"C:\Users\blake\PycharmProjects\Transcendence3.0\DNA\Probabilities"

with gzip.open(input_file, "rt") as handle:
    for record in SeqIO.parse(input_file, "fasta", generic_dna):
        fasta_sequences = SeqIO.parse(open(input_file), 'fasta')
        print("seq parsed")
        with open(output_file) as out_file:
            for fasta in fasta_sequences:
                name, sequence = fasta.id, str(fasta.seq)
                new_allele = tell_basepair(sequence)
                write_fasta(out_file)

def tell_basepair(n, seq):
  bp = seq[n-1]
  return bp

but it doesn't work and I get an error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 386: character maps to <undefined>
  • does it work if you replace "some_function(sequence)" with just "sequence"? – Oka Apr 02 '19 at 22:26
  • 1
    Possible duplicate of [SeqIO.parse on a fasta.gz](https://stackoverflow.com/questions/42757283/seqio-parse-on-a-fasta-gz) – Chris_Rands Apr 03 '19 at 07:42
  • @Blake Young : is the issue solved or do you still get the error? – Oka Apr 05 '19 at 12:40
  • @Oka Thanks for your responses. I've tried what you said and am still getting the error I had before. I'm now using the code editted above – Blake Young Apr 11 '19 at 03:16
  • @BlakeYoung While in your current code you do open the .gz file as handle, you don´t use the opened file in your `SeqIO.parse` call, but instead you are trying to open the .gz file _again_ with `SeqIO.parse`. Try to modify that line into: " for record in SeqIO.parse(handle, "fasta", generic_dna)..." – Oka Apr 11 '19 at 14:54
  • @BlakeYoung Also, what are you trying to do - e.g. what is the desired output? Cause if you just need to print the n-th allele from sequence, you don´t need the output as fasta (which is in your code now) – Oka Apr 11 '19 at 22:07

1 Answers1

0

You could

  • try to open .gz file first before reading it as fasta [1]:
    with gzip.open("practicezip.fasta.gz", "rt") as handle:
        for record in SeqIO.parse(handle, "fasta"):
            #your code
  • if the error persists, specify the encoding in your parse-function. As BioPython SeqIO manual states: "For file formats like FASTA where the alphabet cannot be determined, it may be useful to specify the alphabet explicitly". So:
    from Bio import SeqIO
    from Bio.Alphabet import generic_dna 
    filename = "yourfastafilename"
    for record in SeqIO.parse(filename, "fasta", generic_dna):
        # your code

In addition to the UnicodeDecodeError error, you may also want to define your function some_function(sequence), otherwise Python will not know what to do when you call it. For example:

def tell_basepair(n, seq):
  bp = seq[n-1]
  return bp
Oka
  • 1,318
  • 6
  • 11
  • this is wrong, they need to use `gzip.open()` see the duplicate – Chris_Rands Apr 03 '19 at 07:43
  • can you explain how a `UnicodeDecodeError` could be fixed by adding the `generic_dna` alphabet? i can't imagine how myself – Chris_Rands Apr 03 '19 at 14:15
  • In this case the error is most likely caused by the .gz file (as you correctly pointed out). Sometimes, however, encoding error can be due to the fact that encoding couldn´t be inferred from the file (fasta). If that is the case, this hopefully would help. – Oka Apr 03 '19 at 14:50
  • the alphabet is just a property of the `Seq` class, it is not the same as the encoding – Chris_Rands Apr 03 '19 at 15:18
  • You´re right, its not. And encoding would be a third potential cause of the error (at least according to bugreports). – Oka Apr 03 '19 at 15:46