2

I wanted to remove part of the headings/annotations for a FASTA genome file so I could maintain only the locus tags and the protein description.

Eg. Convert:

lcl|CP000438.1_cds_ABJ14958.1_2 [gene=dnaN] [locus_tag=PA14_00020] [protein=DNA polymerase III, beta chain] [protein_id=ABJ14958.1] [location=2056..3159] [gbkey=CDS] ATGCATTTCACCATTCAACGCGAAGCCCTGTTGAAACCGCTGCAACTGGTCGCCGGCGTCGTGGAACGCC GCCAGACATTGCCGGTTCTCTCCAACGTCCTGCTGGTGGTCGAAGGCCAGCAACTGTCGCTGACCGGCAC

to :

[locus_tag=PA14_00020] [protein=DNA polymerase III, beta chain] ATGCATTTCACCATTCAACGCGAAGCCCTGTTGAAACCGCTGCAACTGGTCGCCGGCGTCGTGGAACGCC GCCAGACATTGCCGGTTCTCTCCAACGTCCTGCTGGTGGTCGAAGGCCAGCAACTGTCGCTGACCGGCAC

I would like to modify all the headers in my FASTA file in this manner. I only recently started learning python so I'm pretty lousy at writing the code for such tasks. I would greatly appreciate it if anyone could help.

  • Instead of trying to reinvent the wheel, take a look at the [`fasta`](https://pypi.org/project/fasta/) module. – accdias Oct 26 '22 at 16:29

1 Answers1

1

Supposing that your header is a string. You can use a regular expression to isolate the components of your header that look like [key=value]. Then filter according to your need to keep only locus_tag and protein. Finally you can build the target header string using join().

import re

PATTERN = re.compile(r"\[(\w*)=([^\]]*)\]")

header = "cl|CP000438.1_cds_ABJ14958.1_2 [gene=dnaN] [locus_tag=PA14_00020] [protein=DNA polymerase III, beta chain] [protein_id=ABJ14958.1] [location=2056..3159] [gbkey=CDS]"
# obtain a list of tuples (key, value)
keyvalues: list[tuple[str,str]] = PATTERN.findall(header)
# obtain a list of formatted strings [key=value], filtered
keyvalues: list[str] = [f"[{k}={v}]" for k, v in keyvalues if k in ("locus_tag", "protein")]
# rebuild the header string
header = " ".join(keyvalues) # [locus_tag=PA14_00020] [protein=DNA polymerase III, beta chain]
0x0fba
  • 1,520
  • 1
  • 1
  • 11