1

I am writing a csv reader to generate Genbank files to capture annotations with sequence.

First I used a Bio.SeqRecord and got correctly formatted output but the SeqRecord class lacks fields that I need.

Blockquote FEATURES Location/Qualifiers
HCDR1 27..35
HCDR2 50..66
HCDR3 99..109

I switched to Bio.GenBank.Record and have the needed fields except now the annotation formatting is wrong. It can't have the extra "type:" "location:" and "qualifiers:" text and the information should all be on one line.

Blockquote FEATURES Location/Qualifiers
type: HCDR1
location: [26:35]
qualifiers:
type: HCDR2
location: [49:66]
qualifiers:
type: HCDR3
location: [98:109]
qualifiers:

The code for pulling annotations is the same for both versions. Only the class changed.

# Read csv entries and create a container with the data
        container = Record()
        container.locus = row['Sample']
        container.size = len(row['Seq'])
        container.residue_type="PROTEIN"
        container.data_file_division="PRI"
        container.date = (datetime.date.today().strftime("%d-%b-%Y")) # today's date 
        container.definition = row['FullCloneName']
        container.accession = [row['Vgene'],row['HCDR3']]
        container.version = getpass.getuser()
        container.keywords = [row['ProjectName']]
        container.source = "test"
        container.organism = "Homo Sapiens"
        container.sequence = row['Seq']

        annotations = []
        CDRS = ["HCDR1", "HCDR2", "HCDR3"]
        for CDR in CDRS:
            start = row['Seq'].find(row[CDR])
            end = start + len(row[CDR])
            feature = SeqFeature(FeatureLocation(start=start, end=end), type=CDR)
            container.features.append(feature)

I have looked at the source code for Bio.Genbank.Record but can't figure out why the SeqFeature class has different formatting output compared to Bio.SeqRecord.

Is there an elegant fix or do I write a separate tool to reformat the annotations in the Genbank file?

JoeT
  • 13
  • 1
  • 4

1 Answers1

0

After reading the source code again, I discovered Bio.Genbank.Record has its own Features method that takes key and location as strings. These are formatted correctly in the output Genbank file.

CDRS = ["HCDR1", "HCDR2", "HCDR3"]
        for CDR in CDRS:
            start = row['Seq'].find(row[CDR])
            end = start + len(row[CDR])
            feature = Feature()
            feature.key = "{}".format(CDR)
            feature.location = "{}..{}".format(start, end)
            container.features.append(feature)
JoeT
  • 13
  • 1
  • 4