2

Edit : I know feature.type will give gene/CDS and feature.qualifiers will then give "db_xref"/"locus_tag"/"inference" etc. Is there a feature. object which will allow me to access the location (eg: [5240:7267](+) ) directly?

This URL give a bit more info, though I can't figure out how to use it for my purpose... http://biopython.org/DIST/docs/api/Bio.SeqFeature.SeqFeature-class.html#location_operator

Original Post:

I am trying to modify the location of features within a GenBank file. Essentially, I want to modify the following bit of a GenBank file:

 gene            5240..7267
                 /db_xref="GeneID:887081"
                 /locus_tag="Rv0005"
                 /gene="gyrB"
 CDS             5240..7267
                 /locus_tag="Rv0005"
                 /inference="protein motif:PROSITE:PS00177"
                 ...........................

to

 gene            5357..7267
                 /db_xref="GeneID:887081"
                 /locus_tag="Rv0005"
                 /gene="gyrB"
 CDS             5357..7267
                 /locus_tag="Rv0005"
                 /inference="protein motif:PROSITE:PS00177"
                 .............................

Note the changes from 5240 to 5357

So far, from scouring the internet and Stackoverflow, I have:

from Bio import SeqIO
gb_file = "mtbtomod.gb"
gb_record = SeqIO.parse(open(gb_file, "r+"), "genbank")
rvnumber = 'Rv0005'
newstart = 5357

final_features = []

for record in gb_record:
  for feature in record.features:
    if feature.type == "gene":
        if feature.qualifiers["locus_tag"][0] == rvnumber:
            if feature.location.strand == 1:
                feature.qualifiers["amend_position"] = "%s:%s" % (newstart, feature.location.end+1)
            else:
                # do the reverse for the complementary strand
    final_features.append(feature)
  record.features = final_features
  with open("testest.gb","w") as testest:
    SeqIO.write(record, testest, "genbank")

This basically creates a new qualifier called "amend_position".. however, what I would like to do is modify the location directly (with or without creating a new file...)

Rv0005 is just an example of a locus_tag I need to update. I have about 600 more locations to update, which explains the need for a script.. Help!

PyKa
  • 113
  • 1
  • 7
  • cant we use direct file operations – sundar nataraj Jul 08 '14 at 16:09
  • That would be ideal but I couldn't find the correct syntax – PyKa Jul 08 '14 at 16:12
  • dou want to replace all the 5240 to 5357 right? – sundar nataraj Jul 08 '14 at 16:14
  • Yes - although I have about 600 genes to look for, then edit the start or end location of each one - I have an excel sheet with the gene names and new locations. Do you mean an RE search/replace while processing the genbank file as a text file? – PyKa Jul 08 '14 at 16:16
  • i reallly dint understood ? can you explain clearly what you want – sundar nataraj Jul 08 '14 at 16:19
  • "Rv0005" is a gene. That gene has a location within the GenBank file, that is "5240..7267". I need to change the 5240 to 5357, that is the new location will be "5357..7267". "Rv0075" is another gene, with location "83996..85168" to be changed to "83901..85168". There are in total about 4000 genes (ie, 4000 Rv numbers) within the GenBank file, and I need to modify about 600 of them. – PyKa Jul 08 '14 at 16:28

3 Answers3

1

Ok, I have something which now fully works. I'll post the code in case anyone ever needs something similar

__author__ = 'Kavin'

from Bio import SeqIO
from Bio import SeqFeature
import xlrd
import sys
import re

workbook = xlrd.open_workbook(sys.argv[2])
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]

# Create dicts to store TSS data
TSS = {}
row = {}
# For each entry (row), store the startcodon and strand information
for i in range(2, sheet.nrows - 1):
    if data[i][5] < -0.7:   # Ensures BASS score is within significant range
        Gene = data[i][0]
        row['Direction'] = str(data[i][3])
        row['StartCodon'] = int(data[i][4])
        TSS[str(Gene)] = row
        row = {}
    else:
        i += 1

# Create an output filename based on input filename
outfile_init = re.search('(.*)\.(\w*)', sys.argv[1])
outfile = str(outfile_init.group(1)) + '_modified.' + str(outfile_init.group(2))

final_features = []
for record in SeqIO.parse(open(sys.argv[1], "r"), "genbank"):
    for feature in record.features:
        if feature.type == "gene" or feature.type == "CDS":
            if TSS.has_key(feature.qualifiers["locus_tag"][0]):
                newstart = TSS[feature.qualifiers["locus_tag"][0]]['StartCodon']
                if feature.location.strand == 1:
                    feature.location = SeqFeature.FeatureLocation(SeqFeature.ExactPosition(newstart - 1),
                                                                  SeqFeature.ExactPosition(
                                                                      feature.location.end.position),
                                                                  feature.location.strand)
                else:
                    feature.location = SeqFeature.FeatureLocation(
                        SeqFeature.ExactPosition(feature.location.start.position),
                        SeqFeature.ExactPosition(newstart), feature.location.strand)
        final_features.append(feature)  # Append final features
    record.features = final_features
    with open(outfile, "w") as new_gb:
        SeqIO.write(record, new_gb, "genbank")

This assumes usage such as python program.py <genbankfile> <excel spreadsheet>

This also assumes a spreadsheet of the following format:

Gene Synonym Tuberculist_annotated_start Orientation Re-annotated_start BASS_score

Rv0005 gyrB 5240 + 5357 -1.782

Rv0012 Rv0012 14089 + 14134 -1.553

Rv0018c pstP 23181 - 23172 -2.077

Rv0032 bioF2 34295 + 34307 -0.842

Rv0037c Rv0037c 41202 - 41163 -0.554

PyKa
  • 113
  • 1
  • 7
0

So, you can try something like below. As the number of change will be equal to the number of CDS/genes found in the file. You can read the locations/positions from csv/text file and make a list like I manually made change_values.

import re
f = open("test.txt")
change_values=["1111", "2222"]
flag = True
next = 0
for i in f.readlines():
    if i.startswith(' CDS') or i.startswith(' gene'):
        out = re.sub(r"\d+", str(change_values[next]), i)
                #Instead of print write
        print out
        flag = not flag
        if flag==True:
            next += 1
    else:
                #Instead of print write
        print i

Amy sample test.txt file looks like this:

 gene            5240..7267
                 /db_xref="GeneID:887081"
                 /locus_tag="Rv0005"
                 /gene="gyrB"
 CDS             5240..7267
                 /locus_tag="Rv0005"
                 /inference="protein motif:PROSITE:PS00177"
                 ...........................

 gene            5240..7267
                 /db_xref="GeneID:887081"
                 /locus_tag="Rv0005"
                 /gene="gyrB"
 CDS             5240..7267
                 /locus_tag="Rv0005"
                 /inference="protein motif:PROSITE:PS00177"
                 ...........................

Hope, this will solve your issue. Cheers!

Sharif Mamun
  • 3,508
  • 5
  • 32
  • 51
  • Thanks, that's an interesting take on the problem. The only issues are that I don't need to change all locations (only some genes need location changing) and I need a way of dealing with reverse strand (where the second number of the XXXX..YYYY entry needs changing). – PyKa Jul 09 '14 at 10:52
  • haha, you have so many rules for such a small file. You are not going get an answer from SO, I am pretty much sure about that :) – Sharif Mamun Jul 09 '14 at 17:00
0

I think this can be done with native biopython synthax, no regex needed, minimal working example here:

from Bio import SeqIO
from Bio import SeqFeature
import copy

gbk = SeqIO.read('./test_gbk', 'gb')

index = 1

feature_to_change = copy.deepcopy(gbk.features[index]) #depends what feature you want to change, 
#can also be done with loop if you want to change them all, or by some function...

new_start = 0
new_end = 100

new_feature_location = SeqFeature.FeatureLocation(new_start, new_end, feature.location.strand) #create a new feature location object

feature_to_change.location = new_feature_location #change old feature location

del gbk.features[index] #remove changed feature

gkb.features.append(feature_to_change) #add identical feature with new location

gbk.features = sorted(gbk.features, key = lambda feature: feature.location.start) # if you want them sorted by the start of the location, which is the usual case

SeqIO.write(gbk, './test_gbk_with_new_feature', 'gb')
Felix Z.
  • 317
  • 2
  • 12