2

The goal is to extract 100 characters before and after the keyword "bankruptcy".

str = "The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s."

pattern = r"(?i)\s*(?:\w|\W){0,100}\b(?:bankruptcy)\b\s*(?:\w|\W){0,100}"

import re

output = re.findall(pattern, str)

Expected output:

['The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.', 
 'The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.']

Current output: ['The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.']

Is there a way to resolve overlapping indexes using re.findall?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
thecoder
  • 237
  • 6
  • 16

1 Answers1

2

You may use the following solution based on the PyPi regex module (install with pip install regex):

import regex
text = "The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s."
pattern = r"\b(?<=(.{0,100}))(bankruptcy)\b(?=(.{0,100}))"
print( [f"{x}{y}{z}" for x,y,z in regex.findall(pattern, text, flags=regex.I|regex.DOTALL)] )
# => ['The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.', 'The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.']

See the Python demo online. Regex details:

  • \b - a word boundary
  • (?<=(.{0,100})) - a positive lookbehind that matches a location that is immediately preceded with any 0 to 100 chars (note regex.DOTALL allows the . to match any chars) that are captured into Group 1
  • (bankruptcy) - Group 2: bankruptcy (matched in a case insensitive way due to regex.I flag)
  • \b - a word boundary
  • (?=(.{0,100})) - a positive lookahead that matches a location immediately followed with 0 to 100 chars.

Since the lookbehinds and lookaheads do not consume the patterns they match, you may access all the chars on the left and on the right of the word you search for.

Note re can't be used because it does not allow non-fixed width patterns in lookbehinds.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    @thecoder I do almost all complex regex parsing with PyPi regex module now. It has proven much quicker and more reliable than `re` (with the exception of some few scenarios). No wonder why a lot of Python users did not like regex. If they used PyPi regex from the start, I think they'd love it. – Wiktor Stribiżew Jun 25 '20 at 17:02