3

I have a series of documents that can have this format:

Diagnosis of one of the following: A) Neovascular (wet) age-related
macular degeneration OR B) Macular edema following retinal vein
occlusion, OR C) Diabetic macular edema OR D) Diabetic retinopathy in
patients with diabetic macular edema. More text here.

PA Criteria

Criteria Details


Eylea (s)

Products Affected
 EYLEA

Exclusion
Criteria

Required
Medical
Information

Age Restrictions

Prescriber
Restrictions

Coverage
Duration

Other Criteria

Off Label Uses











12 months

Indications

All Medically-accepted Indications.

Formulary ID 20276, Version 12

101

I would like to match (and then remove) all text that is in a paragraph ending with a period. So, I would like to remove

Diagnosis of one of the following: A) Neovascular (wet) age-related
macular degeneration OR B) Macular edema following retinal vein
occlusion, OR C) Diabetic macular edema OR D) Diabetic retinopathy in
patients with diabetic macular edema.

and

All Medically-accepted Indications.

I've tried something like this:

\n\n[\s\S]*?[.][\n\n]

but I would somehow need to say that \n\n CANNOT exist in the capture of

[\s\S]*?

How would I do this?

Thanks!

Arthur Lee
  • 121
  • 1
  • 2
  • 14

3 Answers3

3

You can use either of the following regular expressions to accomplish this.

Option 1

This option uses re.DOTALL.

See regex in use here

(?:\A|\n{2})(?:(?!\n{2}).)+\.(?=\n{2}|\Z)

How it works:

  • (?:\A|\n{2}) match either of the following:
    • \A assert position at the start of the string (different than ^ - which asserts position at the start of the line)
    • \n{2} match two consecutive newline characters
  • (?:(?!\n{2}).)+ tempered greedy token matching any character, but failing to match two consecutive newline characters
  • \. match . literally
  • (?=\n{2}|\Z) lookahead matching either of the following (asserts what follows matches, without including the match in the result):
    • \n{2} match two consecutive newline characters
    • \Z opposite of \A - assert position at the end of the string (different than $ - which asserts position at the end of the line)

Option 2

This option is more efficient than Option 1 - using approx 22% less steps.

See regex in use here

(?:\A|\n{2})(?:.|\n(?!\n))+\.(?=\n{2}|\Z)

How it works (most of this is the same as the previous, so I'll only explain the difference):

  • (?:.|\n(?!\n))+ matches any character (except \n since . doesn't match newline characters) or \n if it's not followed by another \n

Option 3

This only works in PCRE or with the PyPi regex package. This is more efficient than the other options above - 21% less steps than Option 2 and 39% less steps than Option 1. This regex uses the re.DOTALL option.

See regex in use here

(?:\A|\n{2})(?:\n{2}(*SKIP)(*FAIL)|.)+?\.(?=\n{2}|\Z)

How it works (again, mostly the same, just explaining the difference):

  • (?:\n{2}(*SKIP)(*FAIL)|.)+? match either of the following one or more times, but as few as possible (+? - lazy quantifier)
    • \n{2}(*SKIP)(*FAIL) match two consecutive newline characters, then fail it ((*SKIP)(*FAIL) is like magic that prevents the regex from backtracking past its current position and then fails the current match. Simply put, this skips all the characters matched to the left of (*SKIP) (up to and including the \n\n), then continues pattern matching after that position (see this question for more info).
ctwheels
  • 21,901
  • 9
  • 42
  • 77
0

Here's a simple solution that doesn't require any modules:

doc = '...'

ps = '\n\n'.join([p for p in d.split('\n\n') if not p.endswith('.')])

This will result in the exact same format as the original.




If you wish to have it more tidy:

ps = '\n\n'.join([p for p in d.split('\n\n') if not p.endswith('.') and p.strip()])
Red
  • 26,798
  • 7
  • 36
  • 58
0

((.+\n)*(.*\.\n)) should do the trick - demonstrated here

(.+\n) Capture a line (including newline) that include 1 or more characters

(.+\n)* Do it zero or more times

((.+\n)*(.*\.\n)) And also include a following line of zero or more characters that ends in a period then newline

Ciaran Woodward
  • 146
  • 2
  • 11