4

I have a bunch of documents and I'm interested in finding mentions of clinical trials. These are always denoted by the letters being in all caps (e.g. ASPIRE). I want to match any word in all caps, greater than three letters. I also want the surrounding +- 4 words for context.

Below is what I currently have. It kind of works, but fails the test below.

import re
pattern = '((?:\w*\s*){,4})\s*([A-Z]{4,})\s*((?:\s*\w*){,4})'
line = r"Lorem IPSUM is simply DUMMY text of the printing and typesetting INDUSTRY."
re.findall(pattern, line)
Chris Daly
  • 65
  • 4

4 Answers4

2

Would the following regex works for you?

(\b\w+\b\W*){,4}[A-Z]{3,}\W*(\b\w+\b\W*){,4}

Tested here: https://regex101.com/r/nTzLue/1/

Allan
  • 12,117
  • 3
  • 27
  • 51
  • In OP's input `Lorem IPSUM is simply DUMMY text of the printing and typesetting INDUSTRY` this regex doesn't find `IPSUM` – anubhava May 01 '18 at 10:47
2

On the left side you could match any word character \w+ one or more times followed by any non word characters \W+ one or more times. Combine those two in a non capturing group and repeat that 4 times {4} like (?:\w+\W+){4}

Then capture 3 or more uppercase characters in a group ([A-Z]{3,}).

Or the right side you could then turn the matching of the word and non word characters around of what you match on the left side (?:\W+\w+){4}

(?:\w+\W+){4}([A-Z]{3,})(?:\W+\w+){4}

The captured group will contain your uppercase word and the on capturing groups will contain the surrounding words.

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • In OP's input `Lorem IPSUM is simply DUMMY text of the printing and typesetting INDUSTRY` this regex is finding only one match though there are 3 such capital letter words. – anubhava May 01 '18 at 10:46
2

You may use this code in python that does it in 2 steps. First we split input by 4+ letter capital words and then we find upto 4 words on either side of match.

import re

str = 'Lorem IPSUM is simply DUMMY text of the printing and typesetting INDUSTRY'

re1 = r'\b([A-Z]{4,})\b'
re2 = r'(?:\s*\w+\b){,4}'

arr = re.split(re1, str)

result = []

for i in range(len(arr)):
    if i % 2:
        result.append( (re.search(re2, arr[i-1]).group(), arr[i], re.search(re2, arr[i+1]).group()) )


print result

Code Demo

Output:

[('Lorem', 'IPSUM', ' is simply'), (' is simply', 'DUMMY', ' text of the printing'), (' text of the printing', 'INDUSTRY', '')]
anubhava
  • 761,203
  • 64
  • 569
  • 643
1

This should do the job:

pattern = '(?:(\w+ ){4})[A-Z]{3}(\w+ ){5}'
Gaterde
  • 779
  • 1
  • 7
  • 25