I'm looking for a good library to extract information of a genbank (gbk) file using R.
this is a common structure of a gbk file
gene complement(1..1002)
/gene="bla"
/locus_tag="VV1_RS00005"
/old_locus_tag="VV1_0001"
CDS complement(1..1002)
/gene="bla"
/locus_tag="VV1_RS00005"
/old_locus_tag="VV1_0001"
/inference="COORDINATES: similar to AA
sequence:RefSeq:WP_011078129.1"
/note="Derived by automated computational analysis using
gene prediction method: Protein Homology."
/codon_start=1
/transl_table=11
/product="class A beta-lactamase"
/protein_id="WP_011078129.1"
/translation="MERFMNRSIALCFTLLISSFVPIQPAVANEHNFKDVSQKLETIS
QRLVGRIGVAAQEIGSGERITVNGDEMFVMASTYKVAIAVALLERIDKGELKLSDLID"
gene complement(1131..2111)
/locus_tag="VV1_RS00010"
/old_locus_tag="VV1_0002"
CDS complement(1131..2111)
/locus_tag="VV1_RS00010"
/old_locus_tag="VV1_0002"
/inference="COORDINATES: similar to AA
sequence:RefSeq:WP_017029542.1"
/note="Derived by automated computational analysis using
gene prediction method: Protein Homology."
/codon_start=1
/transl_table=11
/product="GTP-binding protein"
/protein_id="WP_043920887.1"
/translation="MSKKPIPVTILAGFLGAGKTTLLNHILTNANGMRMAVIVNDFGS
INVDAELVKSESDNMISLENGCVCCNLAEGLVVSVMRLLALEQRPDHIVVETSGISEP"
So I want to extract the information associated to CDS, something like
>gene|product|locus_tag|old_locus_tag|sequence:RefSeq|protein_id|complement
translation
for the first CDS will be like:
>bla|class A beta-lactamase|VV1_RS00005|VV1_0001|WP_011078129.1|1:1002
MERFMNRSIALCFTLLISSFVPIQPAVANEHNFKDVSQKLETISQRLVGRIGVAAQEIGSGERITVNGDEMFVMASTYKVAIAVALLERIDKGELKLSDLID
and do it for the rest of the CDS, that could be thousands !!!
Sorry, I have not idea how to do it in R
Thanks