Regex to match a paragraph ending with a period

Question

I have a series of documents that can have this format:

Diagnosis of one of the following: A) Neovascular (wet) age-related
macular degeneration OR B) Macular edema following retinal vein
occlusion, OR C) Diabetic macular edema OR D) Diabetic retinopathy in
patients with diabetic macular edema. More text here.

PA Criteria

Criteria Details


Eylea (s)

Products Affected
 EYLEA

Exclusion
Criteria

Required
Medical
Information

Age Restrictions

Prescriber
Restrictions

Coverage
Duration

Other Criteria

Off Label Uses











12 months

Indications

All Medically-accepted Indications.

Formulary ID 20276, Version 12

101

I would like to match (and then remove) all text that is in a paragraph ending with a period. So, I would like to remove

Diagnosis of one of the following: A) Neovascular (wet) age-related
macular degeneration OR B) Macular edema following retinal vein
occlusion, OR C) Diabetic macular edema OR D) Diabetic retinopathy in
patients with diabetic macular edema.

and

All Medically-accepted Indications.

I've tried something like this:

\n\n[\s\S]*?[.][\n\n]

but I would somehow need to say that \n\n CANNOT exist in the capture of

[\s\S]*?

How would I do this?

Thanks!

@Anwarvic With `re.DOTALL` flag, I think that would probably work. — Asocia, Jun 26 '20 at 18:26
You likely want the regex `(?:\A|\n{2})(?:(?!\n{2}).)+\.(?=\n{2}|\Z)` with `re.DOTALL` - see it [here](https://regex101.com/r/XORRzN/1) — ctwheels, Jun 26 '20 at 18:30
Do you wish to keep all the empty spaces after removing, or would you like each paragraph stripped? — Red, Jun 26 '20 at 18:41
Are all periods in the text sentence terminators? Nothing like "Dr. Jones recommends figs as part of a balanced diet."? — Cary Swoveland, Jun 26 '20 at 19:17

ctwheels · Answer 1 · 2020-06-26T20:16:41.677

You can use either of the following regular expressions to accomplish this.

Option 1

This option uses re.DOTALL.

See regex in use here

(?:\A|\n{2})(?:(?!\n{2}).)+\.(?=\n{2}|\Z)

How it works:

(?:\A|\n{2}) match either of the following:
- \A assert position at the start of the string (different than ^ - which asserts position at the start of the line)
- \n{2} match two consecutive newline characters
(?:(?!\n{2}).)+ tempered greedy token matching any character, but failing to match two consecutive newline characters
\. match . literally
(?=\n{2}|\Z) lookahead matching either of the following (asserts what follows matches, without including the match in the result):
- \n{2} match two consecutive newline characters
- \Z opposite of \A - assert position at the end of the string (different than $ - which asserts position at the end of the line)

Option 2

This option is more efficient than Option 1 - using approx 22% less steps.

See regex in use here

(?:\A|\n{2})(?:.|\n(?!\n))+\.(?=\n{2}|\Z)

How it works (most of this is the same as the previous, so I'll only explain the difference):

(?:.|\n(?!\n))+ matches any character (except \n since . doesn't match newline characters) or \n if it's not followed by another \n

Option 3

This only works in PCRE or with the PyPi regex package. This is more efficient than the other options above - 21% less steps than Option 2 and 39% less steps than Option 1. This regex uses the re.DOTALL option.

See regex in use here

(?:\A|\n{2})(?:\n{2}(*SKIP)(*FAIL)|.)+?\.(?=\n{2}|\Z)

How it works (again, mostly the same, just explaining the difference):

(?:\n{2}(*SKIP)(*FAIL)|.)+? match either of the following one or more times, but as few as possible (+? - lazy quantifier)
- \n{2}(*SKIP)(*FAIL) match two consecutive newline characters, then fail it ((*SKIP)(*FAIL) is like magic that prevents the regex from backtracking past its current position and then fails the current match. Simply put, this skips all the characters matched to the left of (*SKIP) (up to and including the \n\n), then continues pattern matching after that position (see this question for more info).

Red · Answer 2 · 2020-06-26T18:45:17.927

0

Here's a simple solution that doesn't require any modules:

doc = '...'

ps = '\n\n'.join([p for p in d.split('\n\n') if not p.endswith('.')])

This will result in the exact same format as the original.

If you wish to have it more tidy:

ps = '\n\n'.join([p for p in d.split('\n\n') if not p.endswith('.') and p.strip()])

edited Jun 26 '20 at 18:45

answered Jun 26 '20 at 18:40

Red

26,798
7
36
58

score 0 · Answer 3 · answered Jun 26 '20 at 18:40

((.+\n)*(.*\.\n)) should do the trick - demonstrated here

(.+\n) Capture a line (including newline) that include 1 or more characters

(.+\n)* Do it zero or more times

((.+\n)*(.*\.\n)) And also include a following line of zero or more characters that ends in a period then newline

Regex to match a paragraph ending with a period

3 Answers3

Option 1

Option 2

Option 3