1

I'm using the following regex to match a paragraph starting with the word "Summary",

([^\']*(?=Summary)[^\']*)

But its matching all the text: regex101a

Also tried

(?<=Summary).*?(?=]\.)

This does not match anything: regex101b

I believe this has to do with the formatting of the text file.

Here is an example:

COMMENT     REVIEWED REFSEQ: This record has been curated by NCBI staff. The
            reference sequence was derived from AC105339.9 and FJ695193.1.
            This sequence is a reference standard in the RefSeqGene project.

        Summary: Adaptor protein complex 3 (AP-3 complex) is a
        heterotrimeric protein complex involved in the formation of
        clathrin-coated synaptic vesicles. The protein encoded by this gene
        represents the beta subunit of the neuron-specific AP-3 complex and
        was first identified as the target antigen in human paraneoplastic
        neurologic disorders. The encoded subunit binds clathrin and is
        phosphorylated by a casein kinase-like protein, which mediates
        synaptic vesicle coat assembly. Defects in this gene are a cause of
        early-onset epileptic encephalopathy. [provided by RefSeq, Feb
        2017].
PRIMARY     REFSEQ_SPAN         PRIMARY_IDENTIFIER PRIMARY_SPAN        COMP
            1-35060             AC105339.9         88079-123138
            35061-35259         FJ695193.1         1-199               c
            35260-57628         AC105339.9         123337-145705

And this is what I am aiming to match:

    Summary: Adaptor protein complex 3 (AP-3 complex) is a
    heterotrimeric protein complex involved in the formation of
    clathrin-coated synaptic vesicles. The protein encoded by this gene
    represents the beta subunit of the neuron-specific AP-3 complex and
    was first identified as the target antigen in human paraneoplastic
    neurologic disorders. The encoded subunit binds clathrin and is
    phosphorylated by a casein kinase-like protein, which mediates
    synaptic vesicle coat assembly. Defects in this gene are a cause of
    early-onset epileptic encephalopathy. [provided by RefSeq, Feb
    2017].
haz
  • 740
  • 1
  • 11
  • 20
  • 1
    You second attempt is something like [`Summary.*?\]\.`](https://regex101.com/r/1cKirB/7). That would match from the first word "Summary" (wherever it is) until the next `].`, it can fail on many cases. – Kobi Aug 31 '17 at 05:18

1 Answers1

3

I think this is a robust pattern to match your paragraph (using the Multiline flag):

^\s+$\n^([ \t]+)Summary.*(?:\n\1[ \t]*\S.*)+

Working example: https://regex101.com/r/P6KlBa/2

  • "Summary" may appear as the first word in a line. We start by matching an empty line, to make sure "Summary" is at the beginning of a paragraph.
  • ([ \t]+) captures the number of spaces at the beginning of each line. Some flavors have \h for horizontal spaces.
  • Summary.* - The first line start with "Summary".
  • (\n\1([ \t]+)*\S.*)* - Match more non-empty lines.
Kobi
  • 135,331
  • 41
  • 252
  • 292