I'm using the following regex to match a paragraph starting with the word "Summary",
([^\']*(?=Summary)[^\']*)
But its matching all the text: regex101a
Also tried
(?<=Summary).*?(?=]\.)
This does not match anything: regex101b
I believe this has to do with the formatting of the text file.
Here is an example:
COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The
reference sequence was derived from AC105339.9 and FJ695193.1.
This sequence is a reference standard in the RefSeqGene project.
Summary: Adaptor protein complex 3 (AP-3 complex) is a
heterotrimeric protein complex involved in the formation of
clathrin-coated synaptic vesicles. The protein encoded by this gene
represents the beta subunit of the neuron-specific AP-3 complex and
was first identified as the target antigen in human paraneoplastic
neurologic disorders. The encoded subunit binds clathrin and is
phosphorylated by a casein kinase-like protein, which mediates
synaptic vesicle coat assembly. Defects in this gene are a cause of
early-onset epileptic encephalopathy. [provided by RefSeq, Feb
2017].
PRIMARY REFSEQ_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN COMP
1-35060 AC105339.9 88079-123138
35061-35259 FJ695193.1 1-199 c
35260-57628 AC105339.9 123337-145705
And this is what I am aiming to match:
Summary: Adaptor protein complex 3 (AP-3 complex) is a
heterotrimeric protein complex involved in the formation of
clathrin-coated synaptic vesicles. The protein encoded by this gene
represents the beta subunit of the neuron-specific AP-3 complex and
was first identified as the target antigen in human paraneoplastic
neurologic disorders. The encoded subunit binds clathrin and is
phosphorylated by a casein kinase-like protein, which mediates
synaptic vesicle coat assembly. Defects in this gene are a cause of
early-onset epileptic encephalopathy. [provided by RefSeq, Feb
2017].