0

I am trying to make a regex pattern to grab part of a string, the file contains certain headers, and all of the headers have the same format. I'm currently using python, and would like to keep it that way.

Here is an example file that I came across:

TI TEST TEST TEST TEST TEST TEST TEST TEST AJSAOISJAO SOAI
   ASASPAOS
SO EITCHA EITCHA EITCHA EITCHA EITCHA EITCHA EITCHA EITCHA 
AB Purpose
   To examine the evidence supporting the use of simulation-based assessments as surrogates for patient-related outcomes assessed in the workplace.
   Method
   The authors systematically searched MEDLINE, EMBASE, Scopus, and key journals through February 26, 2013. They included original studies that assessed health professionals and trainees using simulation and then linked those scores with patient-related outcomes assessed in the workplace. Two reviewers independently extracted information on participants, tasks, validity evidence, study quality, patent-related and simulation-based outcomes, and magnitude of correlation. All correlations were pooled using random-effects meta-analysis.
   Results
   Of 11,628 potentially relevant articles, the 33 included studies enrolled 1,203 participants, including postgraduate physicians (n = 24 studies), practicing physicians (n = 8), medical students (n = 6), dentists (n = 2), and nurses (n = 1). The pooled correlation for provider behaviors was 0.51 (95% confidence interval [Cl], 0.38 to 0.62; n = 27 studies); for time behaviors, 0.44 (95% Cl, 0.15 to 0.66; n = 7); and for patient outcomes, 0.24(95% Cl, 0.02 to 0.47; n = 5). Most reported validity evidence was favorable, though studies often included only correlational evidence. Validity evidence of internal structure (n = 13 studies), content (n = 12), response process (n = 2), and consequences (n = 1) were reported less often. Three tools showed large pooled correlations and favorable (albeit incomplete) validity evidence.
   Conclusions
   Simulation-based assessments often correlate positively with patient-related outcomes. Although these surrogates are imperfect, tools with established validity evidence may replace workplace-based assessments for evaluating select procedural skills.
OI MANEIRAO MANEIRAOMANEIRAOMANEIRAO MANEIRAO
SN 6516516516
EI 849819981981
PD FEB
PY 2015

My current objective is to capture the entire text of the 'AB' header. It is good to note that the length and format of the contents of AB doesn't change that much, its prety much always paragraphs, or a line of text until the next header.

I've tried a bunch of different regexes patterns, the one that got me closer to what I want is:

\nAB ((.*?\n)+)(\n[A-Z]{2}\s)?

However it goes until the end of the file consuming every header it finds, I would like for the pattern to stop matching after encountering the next header after AB, whatever it may be.

The headers follow a pattern of always a line break, after that two uppercase letters and a space, or:

\n[A-Z]{2}\s

Thanks to whomever helps in any way.

My question is different of the normal greedy signs because it is not ordered by a character being not greedy and yet an entire "stop" group.

Ruggi
  • 105
  • 5

1 Answers1

2

Is this what you're looking for?

^AB ([\w\W]*?)(?=\n[A-Z]{2}\s)

Demo

(?=...) is for Positive Lookahead. It asserts that the given subpattern can be matched here, without consuming characters

SanV
  • 855
  • 8
  • 16
  • 1
    Updated as such, to include any two upper case characters – SanV May 01 '19 at 02:29
  • This really worked! THANKS! But I didn't know that the ?= had this effect, can you explain Sanv? – Ruggi May 01 '19 at 02:43
  • 1
    added a note above. Positive Lookahead matches but does not consume any characters _(and stops further matching)_ – SanV May 01 '19 at 02:47
  • 1
    Just a relevant comment, I tested when I woke up the suggested Regex and it was missing a non greedy sign to catch just the content I needed, here's the resulting regex: ``` ^AB ([\w\W]*?)(?=\n[A-Z]{2}\s) ``` – Ruggi May 01 '19 at 14:42
  • 1
    great. thanks for the update. in fact, I had that in the [Demo](https://regex101.com/r/2j38TJ/5) but missed updating in the answer above. i'll fix that as well. Note that you could place a non-capturing positive lookbehind around "AB " as well if you don't need to capture it, like so `(?<=^AB )`. not essential but FYI. – SanV May 01 '19 at 17:05