How to match string pattern in R

Question

I'm looking for a good library to extract information of a genbank (gbk) file using R.

this is a common structure of a gbk file

 gene            complement(1..1002)
                 /gene="bla"
                 /locus_tag="VV1_RS00005"
                 /old_locus_tag="VV1_0001"
 CDS             complement(1..1002)
                 /gene="bla"
                 /locus_tag="VV1_RS00005"
                 /old_locus_tag="VV1_0001"
                 /inference="COORDINATES: similar to AA
                 sequence:RefSeq:WP_011078129.1"
                 /note="Derived by automated computational analysis using
                 gene prediction method: Protein Homology."
                 /codon_start=1
                 /transl_table=11
                 /product="class A beta-lactamase"
                 /protein_id="WP_011078129.1"
                 /translation="MERFMNRSIALCFTLLISSFVPIQPAVANEHNFKDVSQKLETIS
                 QRLVGRIGVAAQEIGSGERITVNGDEMFVMASTYKVAIAVALLERIDKGELKLSDLID"
 gene            complement(1131..2111)
                 /locus_tag="VV1_RS00010"
                 /old_locus_tag="VV1_0002"
 CDS             complement(1131..2111)
                 /locus_tag="VV1_RS00010"
                 /old_locus_tag="VV1_0002"
                 /inference="COORDINATES: similar to AA
                 sequence:RefSeq:WP_017029542.1"
                 /note="Derived by automated computational analysis using
                 gene prediction method: Protein Homology."
                 /codon_start=1
                 /transl_table=11
                 /product="GTP-binding protein"
                 /protein_id="WP_043920887.1"
                 /translation="MSKKPIPVTILAGFLGAGKTTLLNHILTNANGMRMAVIVNDFGS
                 INVDAELVKSESDNMISLENGCVCCNLAEGLVVSVMRLLALEQRPDHIVVETSGISEP"

So I want to extract the information associated to CDS, something like

>gene|product|locus_tag|old_locus_tag|sequence:RefSeq|protein_id|complement
translation

for the first CDS will be like:

>bla|class A beta-lactamase|VV1_RS00005|VV1_0001|WP_011078129.1|1:1002
MERFMNRSIALCFTLLISSFVPIQPAVANEHNFKDVSQKLETISQRLVGRIGVAAQEIGSGERITVNGDEMFVMASTYKVAIAVALLERIDKGELKLSDLID

and do it for the rest of the CDS, that could be thousands !!!

Sorry, I have not idea how to do it in R

Thanks

You might want to try https://bioconductor.org/packages/release/bioc/html/genbankr.html — jay.sf, Nov 24 '21 at 06:28

How to match string pattern in R

0 Answers0