Extract words surrounding a RegEx match using re.findall when there exists an overlapping index

Question

The goal is to extract 100 characters before and after the keyword "bankruptcy".

str = "The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s."

pattern = r"(?i)\s*(?:\w|\W){0,100}\b(?:bankruptcy)\b\s*(?:\w|\W){0,100}"

import re

output = re.findall(pattern, str)

Expected output:

['The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.', 
 'The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.']

Current output: ['The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.']

Is there a way to resolve overlapping indexes using re.findall?

Does this answer your question? [Python regex find all overlapping matches?](https://stackoverflow.com/questions/5616822/python-regex-find-all-overlapping-matches) — ggorlen, Jun 25 '20 at 16:41
`(?:\w|\W){0,100}` does not match 0 to 100 *words*, only 0 to 100 chars. Also, regex does not allow matching multiple matches that share the same start position. — Wiktor Stribiżew, Jun 25 '20 at 16:44
Just edited the question @WiktorStribiżew :) 0-100 characters — thecoder, Jun 25 '20 at 16:47

score 2 · Accepted Answer · answered Jun 25 '20 at 16:50

You may use the following solution based on the PyPi regex module (install with pip install regex):

import regex
text = "The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s."
pattern = r"\b(?<=(.{0,100}))(bankruptcy)\b(?=(.{0,100}))"
print( [f"{x}{y}{z}" for x,y,z in regex.findall(pattern, text, flags=regex.I|regex.DOTALL)] )
# => ['The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.', 'The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.']

See the Python demo online. Regex details:

\b - a word boundary
(?<=(.{0,100})) - a positive lookbehind that matches a location that is immediately preceded with any 0 to 100 chars (note regex.DOTALL allows the . to match any chars) that are captured into Group 1
(bankruptcy) - Group 2: bankruptcy (matched in a case insensitive way due to regex.I flag)
\b - a word boundary
(?=(.{0,100})) - a positive lookahead that matches a location immediately followed with 0 to 100 chars.

Since the lookbehinds and lookaheads do not consume the patterns they match, you may access all the chars on the left and on the right of the word you search for.

Note re can't be used because it does not allow non-fixed width patterns in lookbehinds.

@thecoder I do almost all complex regex parsing with PyPi regex module now. It has proven much quicker and more reliable than `re` (with the exception of some few scenarios). No wonder why a lot of Python users did not like regex. If they used PyPi regex from the start, I think they'd love it. — Wiktor Stribiżew, Jun 25 '20 at 17:02

Extract words surrounding a RegEx match using re.findall when there exists an overlapping index

1 Answers1