1

I have a Genbank file containing a number of sequences. I have a second text file that contains the names of these sequences, as well as some other information about them, in a TSV, which I read in as a pandas dataframe. I used the .sample function to randomly select a name from this data, which i assigned the variable n_name, as shown in the block of code below.

n = df_bp_pos_2.sample(n = 1)
n_value = n.iloc[:2]
n_name = n.iloc[:1]

n_name is equal to the Locus name in the genbank file and is case accurate. I am trying to parse through the genbank file and extract the sequence that has locus = n_name. The genbank file is named all.gb. I have:

from Bio import SeqIO
for seq_record in SeqIO.parse("all.gb", "genbank"):

But I am not too sure what the next line or 2 should be, to parse by locus? Any ideas?

Stoner
  • 846
  • 1
  • 10
  • 30
  • checkout the relevant Biopython tutorial section http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc36 – Chris_Rands Oct 31 '19 at 08:59
  • you are looking for `seq_record.features` then iterate through `features` where you might want to look at `feature.qualifiers['locus_tag']`. Be aware that `locus_tag` is optional. Also the value is list of strings. – Marek Schwarz Oct 31 '19 at 09:01

1 Answers1

0

You could also use a list of locus tags instead of just one locus tag.

from Bio import SeqIO

locus_tags = ["b0001", "b0002"] # Example list of locus tags
records = []

for record in SeqIO.parse('all.gb', 'genbank'):
    for feature in record.features:
        tag = feature.qualifiers.get('locus_tag')
        if tag:
            if tag[0] in locus_tags:
                # Here you need to either extract the feature sequence from the record (using the extract method) if you only want the feature dna sequence, or alternatively get the translation for the protein by accession the 'translation' field in qualifiers, or make a translation of the feature on the fly. Afterwards you canappend the resulting record to `records`.

You can find more about the extract method and the feature qualifiers in the Biopython Cookbook.

seth-1
  • 11
  • 2